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Abstract 



This thesis introduces PEMS2, an improvement to PEMS (Parallel External Memory 
System). PEMS executes Bulk-Synchronous Parallel (BSP) algorithms in an Exter- 
nal Memory (EM) context, enabling computation with very large data sets which 
exceed the size of main memory. Many parallel algorithms have been designed and 
implemented for Bulk-Synchronous Parallel models of computation. Such algorithms 
generally assume that the entire data set is stored in main memory at once. PEMS 
overcomes this limitation without requiring any modification to the algorithm by using 
disk space as memory for additional "virtual processors". Previous work has shown 
this to be a promising approach which scales well as computational resources (i.e. 
processors and disks) are added. However, the technique incurs significant overhead 
when compared with purpose-built EM algorithms. PEMS2 introduces refinements to 
the simulation process intended to reduce this overhead as well as the amount of disk 
space required to run the simulation. New functionality is also introduced, including 
asynchronous I/O and support for multi-core processors. Experimental results show 
that these changes significantly improve the runtime of the simulation. PEMS2 nar- 
rows the performance gap between simulated BSP algorithms and their hand-crafted 
EM counterparts, providing a practical system for using BSP algorithms with data 
sets which exceed the size of RAM. 
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Chapter 1 



Introduction 

1.1 Background and Motivation 

External Memory (EM) algorithms are designed to work with data sets much larger 
than main memory. Though strictly defined in more general terms, EM models 
typically consider a 2-tier memory hierarchy: "main memory" (RAM) and "external 
memory" (disk). Algorithms designed for these models explicitly transfer blocks of 
data between these levels of memory, attempting to minimize the number of transfers 
between them. In addition to minimizing data transfer, EM algorithms may also be 
designed to access external memory (e.g. disk) in an efficient pattern to minimize 
expensive disk seeking. 

Unfortunately, most algorithms are designed for Random Access Memory (RAM) 
models rather than EM. RAM algorithms work with a single level of memory, and 
assume a read or write at any location has a fixed constant cost. Because RAM 
algorithms do not consider locality of reference a performance factor, translating a 
RAM algorithm into an EM algorithm with acceptable performance is not simple or 
automatable in the general case. There are, however, certain classes of algorithms 
which can work well in an EM context despite not being designed with EM specifically 
in mind. 

The goal of this thesis is to enable the practical use of such algorithms on problems 
that exceed the size of RAM, allowing an algorithm to scale beyond the limits of main 

memory without requiring a complete rewrite. Parallel algorithms arc particularly 
desirable in this context since very large problems may exceed the resources of a single 
machine and sequential computation with data of this magnitude in reasonable time 
is generally not feasible. 
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1.2 Computational Models 

1.2.1 Parallel Disk Model (PDM) 

The multiple disk model originally proposed by Vitter and Shriver [I9][20], usually 
referred to as the PDM model, is commonly used for designing disk-based algorithms. 
In PDM, an algorithm has access to an internal random access memory of size M, 
and D disks which transfer in blocks of size B. 1/0 is fully parallel and blocked, i.e. a 
transfer of size BD (to D disks) is considered a single 1/0 operation. The complexity 
of an algorithm is measured exclusively in terms of the number of such 1/0 operations, 
ignoring other factors such as computation time. This reflects the reality that disk 
access is orders of magnitude more expensive than RAM access. Computation time 
of an algorithm may also be given for algorithms with especially high computational 
requirements, though typically 1/0 time dwarfs computation time by a large enough 
margin that computation does not significantly contribute to the total run time. 

1.2.2 Bulk Synchronous Parallel (BSP) and Related Models 

The BSP model [T8] was proposed as an abstract "bridging model for parallel com- 
putation". BSP serves as a common model for both system/hardware and algo- 
rithm/software designers which allows for accurate performance analysis on a wide 
range of parallel computers. BSP considers a set of processors each with independent 
local memory that communicate by sending messages between each other. Compu- 
tation proceeds in a series of synchronised "supersteps" , each of which consists of a 
"computation superstep" followed by a "communication superstep". The total run- 
time of an algorithm is thus the sum of the computation time, communication time, 
and synchronisation time. 

A superstep where each processor sends and receives 0{h) data is called an "h- 
relation". BSP* and Coarse Grained Multicomputer (CGM) [S], other common 
models of parallel computation, are special cases of BSP with restrictions on h to 
ensure a more coarse grained computation. In practical terms BSP* and CGM algo- 
rithms proceed in an identical fashion to BSP algorithms, i.e. in a series of supersteps. 
Accordingly, the three are considered equivalent for much of this thesis and collec- 
tively referred to as "BSP-like" . 
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Though all BSP-like models function similarly, the performance characteristics of 
restricted models have important implications when used with EM. CGM requires 
that each processor works with O(^) local data (i.e. h = 0(^)). This ensures 
balanced computation and communication with coarse granularity. Synchronisation 
overhead is thus minimized, while processor and disk parallelism is exploited effi- 
ciently. Since communication and synchronisation in PEMS is relatively expensive 
due to disk I/O, these characteristics are especially desirable in this context. The 
applications presented in Chapter [8] are CGM algorithms. 

BSP-like algorithms are useful on a wide variety of configurations, particularly 
the common and inexpensive "cluster" style of parallel computer composed of several 
commodity machines connected by a switched Ethernet network. 

Real Processor 

j CPU j g 
RAM 



Figure 1.1: BSP-like Model 



— Network 



1.2.3 EM-BSP Models 

The parallel and distributed memory nature of BSP-like algorithms is advantageous 
from an EM perspective since, as in PDM, a collection of parallel disks can perform 
I/O much faster than a single disk. The EM-BSP, EM-BSP*, and EM-CGM models 
] [B] [Tj augment the corresponding BSP-like model by adding local disk(s) to each 



machine. Such a configuration is shown in Fig. 1.2 

Computation proceeds in supersteps as in BSP, except each processor may access 
local disk as necessary during the computation superstep. Thus, the EM-BSP models 



can be considered a hybrid of the BSP-hke models and the PDM modeh synchroni- 
sation and communication is inherited from BSP, and I/O from PDM. 



Real Processor 



CPU 



l-Q 

I Disk(s) 
RAM 



— Network 



D = 2 Disks 



Figure 1.2: EM-BSP Model 



1.3 Previous Work 
1.3.1 STXXL 

STXXL is a C++ library for EM algorithms. STXXL is composed of many layers, as 



shown in Fig. 1.3 (reproduced from ^Q\). Higher level layers in STXXL make use of 
the lower level layers, though user applications may directly use any layer, bypassing 
higher level functionality if desired. 

The lower level Block Management and Asynchronous I/O Primitive layers pro- 
vide generic functionality useful to EM algorithms, such as asynchronous I/O and 
transparent parallel disk access. 

The STL User Layer provides an implementation of the C++ Standard Template 
Library (STL), the algorithms and data structures component of the C++ standard 
library. This layer can be used to write C++ code in the standard style that functions 
as an EM algorithm, or simplify the porting of existing C++ RAM algorithms to EM. 

The Streaming Layer provides additional functionality that does not fit within the 
confines of the STL API. EM-specific techniques such as pipelining and I/O optimal 
scanning are implemented in this layer. 
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Applications 



STXXL 



STL User Layer 

Containers, Algorithms 



Streaming Layer 

Pipelined Sorting, Scanning 



Blocic ivianagement Layer 

Typed Block, Block Manager, Buffered Streams, 
Block Prefetcher, Buffered Block Writer 



Asynchronous I/O Primitives Layer 

Files, I/O Requests, Disk Queues, Completion Handlers 



Operating System 



(Reproduced from [lOj l 



Figure 1.3: STXXL Design 



STXXL provides a rich suite of EM code, making it simple to write advanced EM 
algorithms at a relatively high level. Notably, all layers above and including the Block 
Management Layer transparently support parallel disks. Thus, applications built with 
STXXL can take advantage of parallel disk performance without any specific effort 
required on behalf of the application developer. The Asynchronous I/O Primitives 
layer provides a simple, low-level, and portable interface to asynchronous I/O. Since 
the asynchronous I/O interface of operating systems is typically more complex and 
varies between systems, this layer is useful to applications that require asynchronous 
I/O but not the higher level functionality of STXXL. 



1.3.2 Cache-Oblivious Algorithms 

Cache-Oblivious algorithms [12] are designed with I/O efficiency in mind (unlike RAM 
algorithms), but without any explicit block size parameters (unlike EM algorithms). 
For example, traditional EM algorithms explicitly transfer blocks of some size B 
between disk and main memory. The algorithm implementation must know the value 
of B at run time. In contrast, a cache-oblivious algorithm is unaware of (or oblivious 
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to) any such parameter, and may transfer data with arbitrary size and ahgnment 
much hke a RAM algorithm. However, unhke most RAM algorithms, cache-oblivious 
algorithms are analysed in terms of memory transfers of an arbitrary size B, and aim 
to minimize the number of transfers much like an EM algorithm]^ Thus, an efficient 
cache-oblivious algorithm is efficient for any B and does not require modification to 
perform well on various systems. 

This approach is particularly useful in the presence of cache hierarchies, where 
many levels of cache are in use at one time, each with a different block size. I/O 
efficiency is an increasingly important performance factor, even for algorithms that 
work only with internal memory (RAM). On modern systems, a cache miss can be 
several hundred times slower than a cache hit [14j. Cache-oblivious literature often 
presents this problem in the context of a modern processor's cache and memory 
hierarchy, though the block size independent nature naturally applies where the lowest 
level of the memory hierarchy is disk. This suggests cache-oblivious algorithms are a 
promising strategy for the design of algorithms that show good performance across a 
very wide range of problems sizes. 



1.3.3 MPI 

MPI (Message Passing Interface) [11] is an Application Programming Interface (API) 
for distributed memory parallel programming. MPI provides communication and 
synchronisation functions useful for many types of parallel program. Most relevant 
to this thesis are the "collective communication" MPI functions, since these can be 
used to implement BSP-like algorithms. 

Collective communication functions in MPI synchronise all processors, then per- 
form communication. There are many different styles of communication available, 
such as MPI_Gather (each processor sends a message to a single processor) or MPI_Alltoall 
(each processor sends a message to every other processor)]^ 

In a BSP-like program implemented with MPI, a call to a collective communication 
represents a communication super step and subsequent superstep barrier. In this way, 

^Cache-oblivious literature typically uses L (for "line"), rather than B. This thesis consistently 
uses B and "block" regardless of whether cache or disk is being discussed. 

more detailed description of the collective communication functions described here can be 
found in Chapter [7] 
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a BSP-like algorithm can be implemented as a series of MPI collective communication 
calls interleaved with computation code. 

MPI is a widely used interface for distributed memory parallel programming with 
many implementations for a variety of systems. 

1.3.4 EM-BSP Simulation 

Many parallel algorithms intended to work with large data sets have been designed 
for BSP-like models. Though these algorithms scale to larger data sets than single 
processor RAM algorithms by exploiting the memory available to several machines, 
unfortunately they do not generally make use of disk and are thus limited to problems 
that fit entirely within main memory. 

Fortunately, it is possible to use these existing algorithms with data larger than 
main memory via simulation in the EM-BSP model^ The basic idea is to simulate 
a number of "virtual processors", each with memory small enough to fit into "real 
processor" main memory. A subset of these virtual processors is executed at once, 
while the (virtual) memories of others are swapped out to disk. Thus it is possible 
to run a bulk-synchronous algorithm with total memory size exceeding that of real 
main memory, limited only by the amount of available disk space. 

To illustrate, consider a BSP-like algorithm that requires 128 processors, each 
with 1 GiB of RAM. If these resources are available, the algorithm may be executed 
directly. However, this is not the case if only 32 processors are available. Nevertheless, 
the algorithm may be executed using these limited resources via simulation as follows: 
for each superstep, rather than run 128 processes in parallel, run 32 processes in 
parallel, storing any generated messages on disk. Then, another round of 32 processes 
is executed in a similar fashion, and so on until all 128 processes have been executed. 
At the end of this process, all computation for the superstep has been completed and 
all communication is stored on disk, so the next superstep may begin. 

In practice, this strategy can be implemented as a library which provides commu- 
nication functions for use by BSP-like applications. In particular, no special operating 
system level support is required. All details pertaining to external memory can be 
managed by this library; the application code need not be changed. 

■^This idea was introduced with the original presentation of the EM-BSP models [13] [7] [6] 
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1.3.5 PEMS 



PEMSl [15] (Parallel External Memory System) is an implementation of the EM-BSP 



simulation technique which provides an API similar to that of MPI. Fig. 1.4 shows 
an overview of the PEMSl design. 



Applications 



PEMSl 



PEI^IS API 

Public interface to PEMS for applications 



EM_Init() 
EM_Finalize() 
EMAbo rt ( ) 
EM_Comm_5ize( ) 
EM_Comm_rank( ) 
EMBcastO 
EM_Scatter( ) 
EM_Gather() 
EM_Gatherv() 
EM_Allgatherv( ) 
EM_Allgatherv( ) 
EM_Alltoall() 
EM_Alltoallv() 
EM_Barrier( ) 
EM_malloc ( ) 
EM_f ree( ) 



UNIX lO 

Synchronous UNIX 
I/O implementation 



MPI 

Library 



GNU Pth 

Library 



Operating System 



Figure 1.4: PEMSl Design 



Significant modifications to PEMSl have been made as a part of this thesis. Where 
the distinction is necessary the previous implementation is referred to as "PEMSl", 
and this improved version as "PEMS2" . Both are collectively referred to as "PEMS" 
where appropriate. 

PEMSl is implemented as a library which transparently handles virtual processor 
swapping, synchronisation, memory allocation, and communication. The user pro- 
gram is an MPI-like program, but communication may be deferred to disk to allow 
the simulation of more processors than are actually available. 

The interface to PEMSl is, with a few exceptions, semantically identical to a 
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subset of MPI, though functions names have a different prefix to avoid conflictqj 
When the apphcations calls collective communication functions, PEMSl internally 
performs the necessary network or I/O operations, swapping virtual processors in 
and out as required. Though much occurs "behind the scenes" , from the application's 
point of view the collective communication operation has been completed exactly as 
if it had been performed directly by MPI. 

Internally, the system's MPI library is used to perform communication between 
virtual processors on separate real processors. I/O is performed using the operating 
system's I/O interface; specifically that of POSIX, the standard common to all UNIX- 
like systems such as GNU/Linux, Solaris, or Mac OS X. 

Thread support in PEMSl is handled via the GNU Pth library, which imple- 
ments user-space threads. This is advantageous for single-core processors since thread 
switching does not incur the overhead of a kernel-level context switch. However, user- 
space threads do not allow for true thread concurrency on multi-core machines. 

PEMSl has been shown to scale well in practice on sorting and list ranking prob- 
lems significantly larger than the total amount of available RAM [16]. The distin- 
guishing characteristic of PEMS is that existing algorithms not explicitly designed 
as EM algorithms may be efficiently used with external memory. In addition to the 
large number of suitable (BSP-like) existing algorithms, a considerable advantage of 
this approach is the ability to exploit distributed memory parallel computers. While 
it is possible to implement distributed memory algorithms using STXXL or a cache- 
oblivious approach, the algorithm must be deliberately designed to have this ability 
- a significantly more difficult task than designing a sequential EM algorithm. Algo- 
rithms designed to support both distributed memory parallelism and external memory 
are relatively rare. In contrast, all algorithms that work with PEMS are inherently 
capable of executing on a distributed memory parallel computer. 

Because of this ability, PEMS can easily scale to extremely large problem sizes 
without requiring any modification to the algorithm. If, for example, one wanted to 
use a straightforward STXXL application on a problem too large to feasibly handle 
with a single computer, the algorithm may require a significant redesign in order to 
scale further. A PEMS application, however, can easily scale to very large problem 



This has been resolved in PEMS2, see ^ 



1.4 
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sizes by adding disk and/or processor resources as necessary. 

Similarly, if a given problem takes an unacceptable amount of time, processor 
and/or disk resources may be added to improve the run time. Experiments in Chap- 
ter |8] and previous work on PEMSl [T3] show that, though the STXXL sort is faster 
than PEMS given equivalent computational resources, computation resources can be 
added until PEMS out-performs the STXXL sort. 

1.4 Summary of Contributions 

This thesis presents PEMS2, an enhanced version of PEMSl with new functionality 
and improved performance and usability. These enhancements include fundamental 
changes to the simulation process, such as new I/O drivers and multi-core support; 
as well as new communication primitives with improved performance characteristics. 

While PEMSl supported parallelism across several machines in a cluster configu- 
ration, SMP (or "multi-core" ) on each of these machines was not explicitly supported. 
Though at the time, most commodity machines were single core, recently multi-core 
has become ubiquitous. PEMS2 introduces support for multi-core machines, allowing 
the simulation to take advantage of multiple cores with less overhead than simply 
running several local MPI processes. The computation performed by the simulated 
algorithm is executed in parallel across many cores, allowing for speedup in compu- 
tation heavy algorithms. However, even for algorithms that are I/O bound, many of 
the improved communication primitives achieve an I/O reduction proportional to the 
number of local cores available. 

PEMSl used explicit, blocking, aligned I/O operations exclusively. While the 
improvements presented here can also work in the same fashion, the implementa- 
tion has been redesigned to allow simple switching between various I/O "drivers". 
In conjunction with other changes, this allows for the use of asynchronous I/O, or 
memory-mapped I/O, both of which have significantly different performance charac- 
teristics to traditional blocking I/O. In particular, memory-mapped I/O is interesting 
because a superstep does not necessarily incur a swap of the entire context as with 
explicit I/O. With memory mapped I/O the memory access characteristics of the sim- 
ulated algorithm dictate the I/O performed during simulation, allowing algorithms 
to take advantage of desirable memory access characteristics when used with PEMS. 
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Experiments show that these I/O strategies are beneficial in some cases, but not 
always an improvement depending on the nature of the simulated algorithm. 

The most powerful communication primitive in PEMSl, Alltoallv, has been 
redesigned to use a new message delivery strategy which avoids the need for an area 
on disk reserved for delivery. The new algorithm thus requires less I/O and disk space 
to perform the same task. In practice this also eases configuration since the user no 
longer needs to calculate the message volume of a given algorithm in order to allocate 
disk space to virtual processors. 

Several common collective communication primitives are merely restricted cases 
of Alltoallv, including all of the primitives implemented in PEMSl. However, these 
can often be implemented much more efficiently than the equivalent call to Alltoallv. 
These primitives, as well as Alltoallv itself, have been optimised to eliminate any 
unnecessary swapping. In particular, with the introduction of multi-core support, 
"rooted" communication primitives can be implemented more efficiently using ap- 
propriate synchronisation techniques. "Rooted" communication primitives are those 
which send to or receive from some root virtual processor, as opposed to Alltoallv 
in which all processors communicate as equals. This thesis introduces a small set 
of thread synchronisation primitives that handle swapping in such cases, to ensure a 
minimal amount of I/O is performed. 

Also introduced in PEMS2 is a new type of collective communication function 
that performs communication as well as computation, unlike those implemented in 
PEMSl which perform communication alone. Reduce, and similar methods, are 
beneficial to certain algorithms since the system can perform the combined com- 
munication and computation more efficiently than a user program could by using 
communication primitives alone. These operations are defined by MPI and used in 
many BSP algorithms, expanding the useful scope of the implementation. 

In addition to these fundamental changes, the implementation has been thoroughly 
rewritten with the intention of being straightforward to use with existing or new 
MPI programs on any appropriate system. MPI programs can be compiled against 
PEMS2 without modification^ making it straightforward to simulate any existing 
MPI algorithm for problems sizes vastly exceeding the amount of available main 

^Assuming, of course, that the program is restricted to the set of calls implemented by PEMS2 
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memory. All parameters of PEMS2 can be passed at run-time to the program through 
command line arguments, simplifying automated or manual experimentation. An 
integrated benchmarking system can record the overall run time of a simulation or a 
fine-grained breakdown of run-time at each superstep. Benchmark results are written 
to a gnuplot compatible file which can be used to generate plots like those in this 
thesis. A comprehensive test suite adapted and augmented from several existing MPI 
test suites ensures correctness at the application level for any configuration. PEMS2 
is freely available on the web p[] under an Open Source license and is straightforward 
to compile and use on any UNIX system. 

1.5 Thesis Outline 

The notation, terminology, and variables used throughout are described in Appendix [B| 
To more thoroughly introduce the reader to the context of this thesis. Chapter |2] 
describes in detail the approach to EM-BSP simulation taken in PEMSl, and lim- 
itations which PEMS2 aims to improve. Subsequent chapters describe how these 
limitations are addressed in PEMS2: Chapter |3] gives a brief overview of the ar- 
chitecture of PEMS2. Chapter |4] describes the modifications necessary to allow the 
simulation to take advantage of multi-core processors, including the synchronisation 
primitives referenced in later chapters. The various styles of I/O available in PEMS2 
are described in Chapter |5} The choice of I/O style does not affect the implementa- 
tion of communication algorithms, but may affect analysis; the consequences of this 
choice are also discussed in Chapter |5j Chapter [6] describes a new message delivery 
strategy which differs significantly from that used in PEMSl, using the Alltoallv 
operation as an example. 

The communication algorithms presented in Chapter [7] make use of the material in 
preceding chapters to implement several new communication methods with improved 
performance characteristics. Chapter |8] then presents several applications built using 
these methods along with theoretical and experimental performance analysis. This 
data shows the claims of improvement in previous chapters translate into "real-world" 
application scenarios. 

Finally, Chapter [9] discusses conclusions that can be drawn from PEMS2 results, 
and suggests potential directions for future work. 



Chapter 2 



Overview of PEMSl 
2.1 Overview 

For clarity tliis overview describes the case where a single real processor is used; the 
strategy for simulation with multiple real processors is similar, and addressed in detail 
in later sections. 

PEMSl implements EM-BSP simulation by assigning a thread to each simulated 
virtual processor. Since there are v virtual processors, v threads exist simultaneously. 
However, only a single thread executes at a given time. 

The application, being a BSP-like algorithm, is a series of computation supersteps 
separated by calls to PEMS communication functions. These functions serve as both 
communication supersteps and superstep barriers. When one is called, PEMS per- 
forms the necessary communication, then swaps the calling thread out out memory. 
At this point there is a superstep barrier, so the thread yields and another thread is 
swapped in, which will eventually call the same communication function and reach 
the same barrieiQ Thus, all threads will eventually synchronise at this barrier and 
the next superstep can begin. Several barriers may actually be used to implement 
a collective communication function in PEMS, but there is always at least one as 
required by the BSP model and its derivatives. 

This process of swapping and synchronisation is relatively straightforward to im- 
plement. PEMSl implements allocation using a basic "bump pointer" allocator, 
which simply allocates memory in a contiguous range by appending new allocations 



and "bumping" (increasing) the end pointer. Fig. 2.1 shows an example of such an 
allocation. To swap, this area of memory is read from / written to disk in a single 
read / write operation, respectively. Whenever a virtual processor is executing, its 
context is swapped in to the same area of RAM. This ensures the address to a given 
memory location remains constant so pointers in the application remain valid. 



^Recall that all virtual processors run identical programs 
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Figure 2.1: Memory Allocation in PEMSl 



Implementing communication is more complex. Following the literature associ- 
ated with PEMSl [I5][16], the communication strategy used is described here using 
the Alltoallv call as an example. Alltoallv is the most powerful collective com- 
munication method: all others implemented in PEMSl can be considered simple cases 
of Alltoallv. 



2.2 Alltoallv 



PEMSl performs message delivery using a special disk area separate from the virtual 
processor contexts, called the "indirect area". The indirect area is statically parti- 
tioned such that each virtual processor has a dedicated region of some fixed size for 
message delivery. Alltoallv is performed in two internal supersteps: messages are 
first written by the sender to the indirect area in a block-aligned and parallel fashion, 
then read from the indirect area by the receiver and delivered to the receiver's context 
on disk. 



Alg. |2.2.1| shows a straightforward implementation of this approach for a single 
processor. 

Note the algorithm style used here differs slightly from that used in previous work 
on PEMS [15] and EM-BSP [13] [7] [6]. The style used here is more implementation 
directed, omitting lines such as "for i in 0. . .t; — 1 do in parallel" which do not actually 
occur in this type of multi-threaded code. When reading or analysing algorithms in 
this style it is best to think from the perspective of a single thread executing the code. 
For example, all algorithms in this thesis are written from a perspective such as "first. 
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deliver my messages"; not "first, each virtual processor delivers their messages". The 
reader must keep in mind that several threads perform these actions simultaneously. 

The notation rrii^j denotes the message sent from virtual processor i to virtual 
processor j. 

Algorithm 2.2.1: SIMPLE- Alltoallv-Seq 
Data: S : Array of pointers to v messages to send 
Data: TZ : Array of pointers to v messages to receive 

— Send Messages — 

1 foreach message rrip^i in S do 

2 1^ Write mp^i to i's indirect area on disk 

3 Swap out 

— Finished Internal Superstep 1 — 

— Begin Internal Superstep 2 — 

— Receive messages — 

4 Swap in 

5 foreach message rrti^p in TZ do 

6 1^ Read mi^p from indirect area on disk to the i*^ location in TZ 

7 Swap out 

— Finished Virtual Superstep — 



In the analysis of Alg. |2.2.1| the following variables are used: 



u An arbitrary bound on the simulated algorithm's message size (i.e. the size of 
a message sent from one virtual processor to another). 

H The (maximum) size of a single virtual processor's context (i.e. the maximum 
amount of memory allocated by any virtual processor) 

B The size of a disk block 



These variables remain free in the stated run times for various communication 
methods presented in this thesis. Their actual value depends on the characteristics of 
a particular application or system configuration. All three are assumed to have the 
same unit, such as bytes. 

To convert I/O volume to run time, the following coefficients are used: 
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G The time required to read / write a single block from / to disk for message 
delivery 

5* The time required to read / write a single block from / to disk for swapping 

G and S are identicaQ but different coefficients are used to keep terms related 
to swapping and terms related to message delivery separate. Chapter |4] describes the 
reasons for this in further detail. 

The notation [[w]] means "w rounded up to the next multiple of 5". 

The terms "I/O volume" or "amount of I/O" are used to refer to an amount 
of I/O in the same unit as /i and w, e.g. bytes. Specifically, it does not refer to 
number of I/O operations in blocks, often referred to as "I/Os" (note plurality) in 
EM literature. I/O volume is stated separately because later sections in this thesis 
investigate non-blocked I/O, and comparison of these approaches is best described 
in terms of volume. This is done only to facilitate discussion and simplify analysis, 
the total run time of algorithms is given in terms of I/O operations ("I/Os"), as is 
typical in EM literature. 

The notation "/q" is used to refer to the amount of I/O performed by line a of the 
algorithm. The amount of I/O performed by a range of lines (inclusive) is denoted 
^^Ia...b" ■ The same notation used with T rather than I refers to the time taken. 



rather than the amount of I/O. For example, line 4 in Alg. 2.2.1 refers to "Swap in", 
therefore: 

h = fi 

Recall that all v virtual processors execute the same code. Thus, if a given line 
in the algorithm performs x I/O, in total the line is responsible for vx I/O, unless 
the line is conditionally executed by only some virtual processors. To avoid excessive 
repetition, phrases such as "for each virtual processor" are omitted from proof expla- 
nations where it is clear that all virtual processors perform the same actions (as is 
the case here). 



'Except with memory-mapped I/O, see ^5.2 
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The variables and notation used here are used consistently throughout this thesis, 
and documented (with others) in Appendix [B| 



Lemma 2.2.1. Alg. 2.2.1 (PEMSl single processor Alltoallv) performs 4:ViJ,+2v'^u} 
total I/O. 



Proof. For each of the v virtual processors: 

The loop at line 1 first writes all v outgoing messages, each of size oj: 

h...2 = v'^oo 
Line 3 swaps out the partition of size /i: 

h = vfi 

Line 4 swaps in the partition of size /i: 

I4 = VjJi 

The loop at line 5 reads all v incoming messages, each of size u: 

Is.. .6 = v'^U 

Line 7 swaps out the partition of size /i: 

I-j = VII 



Finally, line 8 swaps in the partition of size (i: 
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The total I/O performed by the algorithm, in the unit of /i and u (e.g. bytes), is 
therefore: 



-^simple-alltoall-seq — A...2 + -^3 + -^4 + -^5...6 + -^7 + -^8 



□ 



Theorem 2.2.2. Alg. 
2L time. 



2.2.1 



(PEMSl single processor Alltoallv ) takes S^+G2v'^^+ 



Proof. Follows directly from Lem. 2.2.1 since Alg. 2.2.1 performs no network com- 
munication and no significant computation. 

Since messages are delivered one at a time (i.e. each message delivery is a separate 
I/O operation), message deliveries are each of size (ffi^fl meaning "a; rounded 
up to the next multiple of 5"). Thus, if messages are smaller than a single block 
then overhead is accumulated for every message. However, this is not a performance 
problem since a single block of I/O is the minimal amount of time possible for an I/O 
operation. 

There are two internal superstep barriers, contributing 2L to the total run time. 

□ 



Theorem 2.2.3. Alg. 2.2.1 (PEMSl single processor Alltoallv ) requires vfi + v'^u 
disk space. 

Proof. Each virtual processor requires fi disk space for its context regardless of mes- 
sage delivery, f ^ messages are delivered in total, each of size u, therefore an additional 
v'^oj space is required for the indirect area. □ 



2.3 Potential for Improvement 

While experiments with PEMSl have shown desirable scalability characteristics, the 
system has significant overhead which requires the use of considerably more compu- 
tational resources to match the performance of comparable EM algorithms. Though 
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the ability to take advantage of several machines with parallel disks is a considerable 
advantage, reducing this overhead will make the system more competitive on a wider 
range of systems and problem sizes. 

Additionally, the use of a separate area for message delivery introduces scalability 



problems with large contexts (see ^6.3) and makes tuning difficult in practice since 
the user must know in advance the bounds on a given algorithm's communication 
volume in order to allocate disk. 

This thesis introduces several new strategies and capabilities for PEMS intended 
to address these issues. 

As is usually the case with EM algorithms, the most significant source of overhead 
in PEMSl is unnecessary I/O. There are two cases where PEMS must perform I/O: 
swapping and message delivery. 

2.3.1 Swapping 



Each internal superstep barrier in Alg. 2.2.1 implies a swap out and a subsequent 



swap in of each virtual processor. However, many such swaps can be avoided by 
making more extensive use of the "direct delivery to context" technique described 
in the PEMSl literature [15]. This technique is based on the observation that a 
subsequent swap-in in the second internal superstep is not required, since messages 
can be written directly to the context on disk. That is, instead of swapping in the 
context, modifying it in memory, then swapping the context back out; the message 
can simply be written directly to the appropriate location on disk. 

Thus, if the second loop delivers messages directly this way, it is not necessary to 
swap in at the first barrier. The final swap-out is also avoided because the context 
on disk is already known to be consistent]^ hence this avoids 2/i I/O per virtual 
processor. 

Swapping with finer granularity can avoid slightly more unnecessary I/O: the 
swap out at the first internal superstep barrier swaps out the entire context, however 
this is not necessary. Virtual processors receive messages to some area within their 
context, so when the Alltoallv call is completed and control is returned to user code 



■^In fact a swap out can't occur here because the context is not swapped in, so a swap out would 
write garbage data to disk 
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this region will have been overwritten with the received messages. Therefore, it is 
not necessary to swap out this region (the "receive buffer") at the initial superstep 
barrier. 

Some swapping can also be avoided at superstep barriers: in a straightforward 
implementation, all threads swap at superstep barriers. However, for the last thread 
to execute in the superstep this is not necessary. The order of execution within a 
superstep is undefined, so it is wasteful to swap out this thread's context and allow a 
different thread to swap in and execute first in the next superstep. Instead, the last 
thread can simply remain swapped in through the barrier and be the first thread to 
run in the following superstep, thus avoiding one swap per superstep. More generally, 
in the case of multi-core, threads execute in parallel rounds of k threads at a time 
therefore this technique avoids k swaps per superstep. 

2.3.2 Message Delivery 



Alg. |2XT] writes all messages to be delivered to the indirect area on disk. There is 



potential for improvement here based on two observations: 

1. Each message that must be written in the first loop is a part of the sending 
virtual processor's context, and therefore will be written to disk regardless at 
the first barrier (when the sender's context is swapped out). Thus, the previous 
algorithm results in each message being written to disk twice. 

2. If the receiving virtual processor of a message is local and has already executed 
this superstep, then the final destination of the message is known and the mes- 
sage can be delivered directly to the destination context on disk. This avoids 
reading the message from disk again in order to deliver it to the receiver. 

There is an additional downside to delivering messages via a separate disk area: 
because the indirect area is large and separate from the area on disk where contexts 
are stored, delivery of messages (and subsequently swapping out) involves seeking 
across a very large area of disk. In the worst case this results in constantly seeking 
back and forth between the contexts area and the indirect area. This is potentially a 
serious performance issue, particularly for large /i or v. Since disk seeking is extremely 
expensive, the performance impact of this behaviour could be as significant as the 
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actual amount of I/O performed - or even more so. The addition of multi-core support 
compounds the problem due to several threads seeking simultaneously. Reducing or 
eliminating this effect is therefore a promising path to improving the performance of 
PEMS in practice. 

Of course, indirect message delivery is not done without reason: the messages are 
aligned and distributed among disks in a way designed to achieve fully parallel disk 



I/O, and support "direct" I/O which requires all operations to be block aligned. ^6.2 
describes new methods of retaining these desirable characteristics without writing 
messages to a separate area on disk. 

2.3.3 Communication Balancing 

The original EM-BSP simulation algorithms (and PEMSl) require an upper bound 
on communication volume so disk space can be allocated accordingly for the indirect 
area. In the multi-processor case, this is achieved by using a deterministic rout- 
ing technique which first evenly distributes messages across the network before 
completing the communication. Messages are first sent to an arbitrary intermediary 
processor in a round-robin fashion, then sent to their final destination by that interme- 
diate processor. Because messages are evenly distributed to an arbitrary intermediate 
processor, this technique ensures balanced communication. 

This technique is straightforward and works well to ensure balanced communica- 
tion, but in the context of PEMS incurs a large amount of overhead. In order to be 
delivered, each message must be (in the worst case): 

1. Sent over the network (by the sender) 

2. Written to disk (by the intermediary) 

3. Read from disk (by the intermediary) 

4. Sent over the network (by the intermediary) 

5. Written to disk (by the receiver, to the indirect area) 

6. Read from disk (by the destination, from the indirect area) 

7. Written to disk (by the destination, to its context) 
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The multiple reads and writes of each message to disk, in particular, is a significant 
amount of overhead due to the large cost of disk I/O. 



2.3.4 Allocation 

The simple memory allocation scheme used by PEMSl has a serious limitation for 
many programs: freeing memory is not possible. Since only a pointer to the end of 
all allocated memory is stored, there is no way to free a particular chunk of allocated 
memory. 

While some BSP-like algorithms allocate a large amount of memory initially then 



use it throughout execution (such as the PSRS algorithm presented in ^8.3), many 
have more dynamic memory allocation requirements. PEMSl's basic allocator is not 
appropriate for algorithms that continuously allocate and free chunks of memory, since 
memory comsumption will continue to increase until available space is exhausted. 



2.3.5 Improvements 

Chapter |6] presents solutions to the shortcomings described in this section, all of 
which are implemented in PEMS2. These improvements depend on two new funda- 
mental aspects of the design: multi-core support and several I/O drivers, presented 
in Chapter |4] and Chapter |5| respectively. 



Chapter 3 



Overview of PEMS2 



3.1 Software Design 



Fig. 3.1 sliows an overview of the PEMS2 design. 

The most significant change from the more static architecture of PEMSl is the 
addition of abstract interfaces for I/O and threading. All use of these subsystems 
occurs through these relatively simple interfaces, which makes the addition of new 
I/O and threading drivers to PEMS2 a straightforward process with little impact on 
other components. 

The original I/O (synchronous) and threading (user-space) implementations from 
PEMSl have been modified to fit within this framework. Both remain available for 
use as user options in PEMS2. 

Two new I/O drivers have been implemented: Asynchronous I/O, which allows 
PEMS to submit many I/O requests to the disk at once and resume computation 



or communication while they are performed, is described in ^5.1 Memory-mapped 
I/O, which allows PEMS to only swap in the required portions of a virtual processor 
context at each superstep, is described in §5.2[ 

A new threading driver based on POSIX threads has been added which supports 
true concurrency, the implications of which are discussed in Chapter |4j 
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Applications 



PEMS API 

Public interface to PEMS for applications. 
Subset of MPI (communication/synchronisation) 
and standard C (aHocation). 
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Figure 3.1: PEMS2 Design 
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3.2 Computational Model 



PEMS2 extends the EM-BSP [13] [7] [6] models shown in Fig. 1.2 with one or more 
"cores" per real processor. Each set of cores on a real processor access a single shared 
main memory, and one or more disks. The cluster of real processors is assumed to be 
homogeneous, i.e. each real processor has k cores and D disks. This extended model 



is shown in Fig. 3.2 



Adding the ability for threads to execute concurrently is a relatively straightfor- 
ward modification to PEMSl (replace the use of GNU Pth functions with POSIX 
threads equivalents). The difficulty in adding multi-core support lies in the implica- 
tions, e.g. more sophisticated synchronisation and inter-thread communication meth- 
ods must be used, and the relevant portions of the system must be made thread-safe. 
The details of how this has been accomplished are discussed in Chapter |4} 



Real Processor 
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Example: k = 4 Cores/Processor, D = 2 Disks 
Figure 3.2: PEMS2 Computational Model 



Chapter 4 



Multi-Core Support 

PEMSl supported only user-space threads via the GNU Pth Ubrary. On single-core 
machines this can be advantageous because user-space threads avoid the overhead 
of context switching. However, achieving true concurrency on a multi-core machine 
requires the use of "real" system threads. 

While it is possible to run several MPI processes concurrently on a single multi- 
core machine, running a single process with a thread for each virtual processor avoids 
the overhead of inter-process communication, synchronisation, and context switching. 
Using threads also allows for more effective parallel disk I/O strategies since PEMS 
can control parallel access to disk(s) in more flexible ways. 

To achieve this, the threading system has been redesigned around a small set 
of simple synchronisation primitives which are safe for both user-space and kernel 
threads. PEMS2 can use either user-space threads via GNU Pth, or system threads 
via the POSIX Threads ("pthreads") API. In either case there is a 1 : 1 relationship 
between threads and virtual processors regardless of the number of cores available. 

The number of virtual processors that execute concurrently on a local real pro- 
cessor is denoted k. The user may choose any value for k provided 1 < k < ^. 

4.1 Memory Partitions 

Thread concurrency in PEMS2 is achieved by allocating k separate memory partitions 
(rather than the single partition used by PEMSl). Thus, k separate threads may be 
swapped in at a given time and perform work concurrently. The user must ensure 
that k/j, real memory is available for these partitions. 

A simple static mapping is used to assign threads to memory partitions: thread t 
uses partition t mod k. A dynamic mapping would be beneficial in many respects, 
but this would have the effect of changing the address of a given piece of virtual 
processor memory, thus invalidating pointers. For example, if a virtual processor is 
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swapped in at memory address 10, a pointer to the first allocated piece of memory 
would have the value 10. If that virtual processor was subsequently swapped in at 
memory address 20, the pointer should have value 20, but still has value 10, thus 
memory is corrupted. Because of this, a dynamic mapping of contexts on disk to 
memory partitions is not feasible within PEMS. 



4.2 Controlling Concurrency 

With the addition of system thread support, several virtual processors may execute 
in parallel on a single real processor, taking advantage of multiple cores. This raises 
an issue when ^ > k (which is generally the case for any reasonable configuration): 
threads can run concurrently, thus more than k threads may attempt to run simul- 
taneously. However, only k memory partitions are available. PEMS itself can not 
explicitly schedule threads A; at a time since the operating system scheduler is used. 
Instead, an exclusive lock (mutex) is associated with each of the k partitions in main 
memory. A thread must obtain a lock on its memory partition before executing any 
part of the simulated virtual processor's algorithm. Therefore, the number of virtual 
processors which can run concurrently on a single real processor is at most k. 



4.3 Thread Synchronisation 

Superstep synchronisation, a simple barrier, is sufficient for collective communica- 
tion methods in which all processors participate as equals. However, there are many 
methods which have more complex synchronisation requirements. A simple example 
of such a method is the broadcast, or Beast. In a Beast, a single virtual processor 



called the "root" sends a message to every other virtual processor (see ^7.2). Thus, 
other virtual processors have to wait specifically for the root to perform some action. 
While full superstep barriers could be used for this purpose, synchronisation meth- 
ods specifically designed for such cases can achieve better performance. With a full 
barrier, each virtual processor waits for every other virtual processor. However, in a 
"rooted" case such as Beast, any virtual processor that reaches the barrier after the 
root need not wait at all. Since I/O is triggered by virtual processor execution (e.g. 
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I/O will occur when a virtual processor calls MPI_Bcast), this can be a significant per- 
formance factor - the sooner a thread passes a barrier, the sooner it can submit further 
I/O requests, resulting in higher throughput and more communication/computation 
overlap. 

The collective communication algorithms presented here require three styles of 
synchronisation (in addition to superstep barriers): 

1. Initial Synchronisation: Wait for the first thread 

2. Rooted Synchronisation: Wait for a specific "root" thread 

3. Final Synchronisation: Wait for all other threads 

These operations are implemented to work with any number of threads running 
at a time, swapping virtual processors in or out as required. 

Since each thread holds its memory partition lock while executing, and other 
inactive threads require the same partition, simply using a primitive signal (e.g. that 
provided by pthreads) would result in a deadlock and/or missed signals. This is 
because primitive signals arc not persistent, i.e. only those threads waiting on a 
signal at the moment it fires arc notified. In PEMS2, a primitive signal with an 
associated counter and fiag make up a composite synchronisation structure. This 
allows for synchronisation both between threads which are currently swapped in (via 
the primitive signal) and threads which are not (via the counter or flag) . 

The primitive signal is only used to synchronise the k currently swapped in threads, 
eliminating the possibility of deadlock. The counter keeps track of how many threads 
have reached the synchronisation barrier, and the flag is used to signal an arbitrary 
condition (e.g. "the root has flnished"). 

This composite signal structure is simply referred to as a "signal"; which is the 
main threading abstraction used to implement our synchronisation primitives. 

All functions described in this section arc called while the thread holds the lock 
on its memory partition. Because swapping is generally the most expensive operation 
performed by PEMS during a simulation, the goal of these primitives is to swap 
only when necessary. Run times stated for these methods only consider time spent 
performing I/O, since no signiflcant computation takes place. 



29 



4.3.1 Rooted Synchronisation 



Alg. |4.3.1| EM-Wait-For-Root waits for the root thread to signal. This is gener- 
ally necessary for any rooted collective communication method (e.g. Beast, Gather). 
Swapping to disk occurs only when a thread is blocking the memory partition required 
by the root. The return value indicates whether the partition has been swapped out, 
which allows the caller to only swap in/out again if necessary. Only the non-root 
threads call this function; the root thread must perform whatever work is required. 



then signal (using Alg. 4.3.5, EM-Signal-Threads) to unblock the other threads. 



Algorithm 4.3.1: EM-Wait-For-Root 



Data: s (signal), t (this thread ID), r (root thread ID) \ t r 
Result: True iff thread was swapped out 

result i — false 
s.lock() 

— // the root has not already signalled 
Current thread's partition 
Thread r 's partition 
If t and r share a partition 



8 
9 

10 
11 

12 
13 
14 



if s.flag 
Pt ^ 

Pr <- 

if Pt 



false then 

t mod k - 
r mod k - 
Pr then 

— Yield to root — 
result < — true 
Swap out 
Unlock partition 

s.wait() — Wait for root to signal 

if Pt = Pr then — Ift and r share a partition 

— Yielded above, so re-lock partition — 
s.unlock() — Release signal lock to prevent deadlock 
Lock partition 
s.lock() 



15 S. count <r- 

16 if s. count 

— Reset signal 

17 s. count i — 

18 s.flag i — false 

19 s.unlock() 

20 return result 



s. count + 1 

^ then — // all non-root threads are finished waiting 



Lemma 4.3.1. Alg. 4-3.1 takes S-^^ time in the worst case. 
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Proof. The only possible I/O occurs at line 8, which is only executed by virtual 
processors which share a memory partition with the root processor. There are k par- 
titions per real processor, shared by ^ virtual processors, thus virtual processors 
may perform I/O. If a virtual processor performs I/O, it swaps out once at line 8, 
resulting in fi I/O per virtual processor that shares a partition with the root. Since 
all virtual processors that perform I/O share a memory partition, only one may be 
swapped in at a given time, therefore no disk parallelism occurs in the worst case 
when striping is not in use. □ 



Note that Lemma 4.3.1 does not take disk striping into consideration, i.e. it is 
assumed that each virtual processor is mapped to a single disk. If PEMS is being 
used on a configuration where all data is striped across all disks, then all I/O is 



inherently fully parallel, and therefore Alg. 4.3.1 would take SpP^ time. 



4.3.2 Initial Synchronisation 

Implementations of several collective communication functions require an arbitrary 
single thread to do some work (e.g. perform MPI communication) before any other 



threads continue. Alg. |4.3.2| (EM-First-Thread), when called by all threads, will 
return true immediately if the caller is the first thread, or otherwise block until the 
first thread has signalled and return false. Note that when true is returned the signal 
is still locked; this allows the first thread to perform the necessary work while other 



threads wait. The first thread must signal (using Alg. 4.3.5 with false as the "lock 



parameter) when it has completed the work in order to wake any waiting threads. 



Lemma 4.3.2. Alg. 4-3.2 performs no I/O. 



4.3.3 Final Synchronisation 

Collective communication calls which collect data at a single root processor (e.g. 
Gather) must wait for other threads to finish their work before the results can be 



gathered and delivered to their final destination. Alg. 4.3.3| (EM- All-Threads 



Finished) along with Alg. |4.3.4| (EM- Wait-Threads) provides the required mech- 
anism. If true is returned, all threads have reached the call and the work may be 
safely performed. Whether or not a swap has occurred is passed as an input/output 
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Algorithm 4.3.2: EM-First-Thread 



Data: t (this thread ID), s (signal) 
Result: True iff caller is the first thread 

1 s.lock() 

2 if s. count = then — // this is the first thread 

— Keep lock and return true — 

3 s.flag < — false 

4 return true 

5 s. count i — (s. count +1) mod ^ 

6 if s.flag = false then — // first thread has not finished 

7 1^ s.wait() 

s if s. count = then — // this is the last thread 

— Reset signal — 
s.flag i — false 

10 s.unlock() 

11 return false 

parameter (e.g. a pointer) to allow cascading several calls without performing un- 
necessary swaps: if true is passed for this parameter, no swap will be performed. 
Otherwise, if a swap is performed, the parameter will be set to true to notify the 
caller. 



Like Alg. 4.3.2 (EM-First-Thread), if false is returned the lock is not released. 



When this happens the caller must call Alg. 4.3.4 (EM-Wait-Threads) which will 
block until all threads have completed. 



Lemma 4.3.3. Alg. 4-3.4 performs at most vfi I/O 



Proof. The only 1/0 performed is a swap out of size fi, which is called v times in the 
worst case (once by each virtual processor). □ 



4.3.4 Signalling 

Both Initial and Rooted synchronisation require a thread to signal the others once 



some work has been performed. Alg. 4.3.5 (EM-Signal-Threads) is used for this 
purpose in both cases. Since these cases have different locking semantics, whether 
the signal lock should be taken is passed as a parameter (specifically: false must be 
passed in the Initial case, and true in the Rooted case). 



32 



Algorithm 4.3.3: EM-All-Threads-Finished 



Data: t (this thread ID) , s (signal) , w (whether swap has occurred) 
Result: True iff the caller is last 

1 result < — true 

2 last < — false 

3 s.lock() 

4 if s. count = ^ — 1 then — // this is the last thread 

— Signal others, reset signal and return true — 

5 s.count i — 

6 s. broadcast 

7 s.unlock() 

8 return true 

,9 else — This is not the last thread 

10 s.count i — (s.count +1) mod p 

11 if s.flag — true then — // the last thread has not already finished 

12 if w = false then — // this thread hasn't already swapped out 
— Swap out and notify caller — 

13 Swap out 

14 w < — true 

— Wait for last thread to finish — 

15 Unlock partition 

16 s.wait() 

17 Lock partition 

18 if s.flag = true then — Last thread has finished 

19 |_ s.unlock() 

20 else — This thread is blocking the last thread 

— Keep lock and return false — 

21 return false 

22 

23 return result 
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Algorithm 4.3.4: EM- Wait-Threads 



Data: s (signal), w (whether swap has occurred) 

1 if w = false then — // this thread hasn't been swapped out yet 

2 Swap out 

3 w i — true 



— Yield partition and wait for signal 

4 Unlock partition 

5 s.wait() 

6 Lock partition 

— Reset signal — 

7 s.flag = false 

8 s. count = 

9 s.unlock() 



Algorithm 4.3.5: EM-Signal-Threads 
Data: t (this thread ID), s (signal), / (whether to lock) 

1 if / = true then 

2 1^ s.lock() 

3 s. count < — (s. count +1) mod 

4 s.fiag i — true — Set flag for threads yet to run 

5 s.broadcast() — Signal the k — 1 other currently running threads 

6 s.unlock() 



Chapter 5 



New I/O Drivers 



5.1 Asynchronous I/O 
5.1.1 Background 

The UNIX system I/O used by PEMSl is synchronous, i.e. a caU to read or write 
blocks until the I/O operation has finished. In some cases this is necessary because 
execution can not continue until I/O is finished, typically because the buffers used are 
required for the next operation. In other cases, however, there is useful work that can 
be safely performed in parallel with the I/O operation. In these cases, asynchronous 
I/O is advantageous. Asynchronous I/O allows an I/O request to be submitted with 
a non-blocking call, and provides a separate mechanism to wait for completion. This 
allows I/O to proceed in parallel with computation, improving overall performance. 

An additional benefit of asynchronous I/O is the ability to send many I/O requests 
to the operating system (OS) at once. With synchronous I/O, this is not possible 
because all I/O requests block. Asynchronous I/O, however, allows submitting many 
requests in rapid succession, keeping the OS and disk busy with I/O requests. This 
is beneficial because the OS attempts to schedule disk I/O optimally when several 
requests are pending. Several algorithms exist for this purpose which yield better 
performance than a trivial First Come First Served (FCFS) algorithm |21]. All mod- 
ern commonly used operating systems include at least one disk scheduling algorithm; 
Linux in particular provides several which may be selected at runtime for a specific 



disk volume (see ^9.1) 



5.1.2 Design 

PEMS2 uses the STXXL [10] file layer for asynchronous I/O. This is the lowest level 
abstraction in STXXL, essentially a portability layer for asynchronous I/O with a 
more elegant interface than the operating system's API. The scheduling and caching 
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mechanisms in other layers of STXXL are not used (see ^ 1.3.1 for additional discussion 
of STXXL). 

The non-trivial modifications to PEMS required for asynchronous I/O are con- 
cerned with waiting for the necessary I/O requests to finish. All I/O in PEMS is 
performed by some virtual processor, and written to / read from the context of an- 
other virtual processor. It is important that threads only wait when necessary to 
avoid blocking other threads which could otherwise proceed. Generally, the thread 
that initiated the I/O request is the only thread that should wait. Accordingly, 
PEMS2 has k independent I/O request queues per real processor, one for each lo- 
cal virtual processor that is swapped in. Each virtual processor can make multiple 
I/O requests (e.g. during message delivery) and explicitly wait for all, or some, of its 
own requests to finish if necessary. Otherwise, all requests are waited on at the next 
superstep barrier before the virtual processor is swapped out. 



5.2 Memory Mapped I/O 
5.2.1 Background 

The I/O approaches previously discussed (both synchronous and asynchronous) have 
a major disadvantage for certain algorithms: at each virtual superstep, the entire 
context of every virtual processor is swapped regardless of how much data the algo- 
rithm actually uses. In cases where the algorithm only accesses a small portion of 
the data (e.g. sampling) this can result in a very large amount of unnecessary I/O. 
This can cause the I/O complexity of the simulation to be far from optimal, partic- 
ularly for algorithms with many supersteps each of which do not access the majority 
of memory. This problem can not be solved with explicit I/O (i.e. read/write calls) 
because PEMS has no way of knowing which areas of memory are actually used by 
the simulated algorithm. 

Special API calls could be added to PEMS to address this problem, but this 
conflicts with the goal of simulating generic BSP-like algorithms, and would not be 
compatible with MPI. Fortunately, there is a mechanism available in all modern 
operating systems which can solve this problem: memory mapped I/O. Memory 
mapping is a facility which allows a file (or other addressable resource) to be mapped 
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onto a range of virtual memory and used normally like any other region of memory, 
without actually reading the entire file into physical memory. Pages are swapped 
to/from disk by the OS as necessary without any effort on behalf of the programmer. 

The critical property of memory mapped I/O is that this page swapping is per- 
formed by the OS kernel which, unlike "userland" code such as PEMS, does know 
which areas of memory are accessed. This allows PEMS to avoid unnecessary swap- 
ping, since the kernel will only swap in/out those regions of memory which are actually 
used by the simulated algorithm. This implies the cache behavior of the algorithm 
may also affect I/O performance - algorithms with favourable memory access pat- 
terns will make use of the kernel-managed cache more effectively, and achieve better 
performance with memory mapped I/O. 

Experiments in §8.4 confirm experimentally that memory mapping avoids a sig- 



nificant amount of I/O in some cases. 
5.2.2 Design 

When used with memory mapping, PEMS2 simply maps the entire used portion 
of disk into memory. Rather than allocate in- memory partitions and swap in/out 
from/to disk, the simulated algorithm works directly with a range of this mapped 
memory. All other aspects of the simulation remain the same, in particular, only k 
virtual processors execute at a given time. If suitable parameters are chosen such 
that kfi fits within physical memory (as it must with explicit I/O), this ensures that 
the amount of virtual memory used at any given time fits within physical memory, 
so thrashing is avoided. 

Because memory-mapped disk regions are used in the same way as any other region 
of memory, message delivery in PEMS2 with memory- mapped I/O is simply a direct 
virtual memory copy (e.g. using memcpy). In degenerate cases where the problem 
size is smaller than the available physical memory, this effectively makes PEMS an 
in-memory multi-core MPI system. This allows PEMS to scale gracefully over a wide 
range of problem sizes from very small, to the majority of physical memory, to much 
larger than physical memory. 



Chapter 6 



Simulation Enhancements 



6.1 Swapping 

A straightforward implementation of many communication algorithms could perform 
many complete swaps in a virtual superstep (i.e. a superstep in the simulated algo- 
rithm), since virtual supersteps may be composed of several internal supersteps (i.e. a 
superstep performed by PEMS). A careful implementation, however, can ensure that 
each virtual processor is completely swapped out and completely swapped in only 
once per virtual superstep. Thus, for explicit I/O, L > 

With the use of memory mapped I/O, supersteps cause no explicit I/O at all. 
In this case the analysis of a simulated algorithm must take into consideration any 
swapping I/O it would cause by accessing its own memory mapped partition. Because 
of this generic bounds for a PEMS simulation using memory mapped I/O can not be 
given, the analysis is specific to a particular algorithm. 

6.2 Message Delivery 

This section introduces a new communication strategy for PEMS which addresses 
the limitations discussed in §2.3[ For illustrative purposes, the basic concept is first 



presented in the form of a simplified algorithm, Alg. 6.2.1 which does not consider 
details such as block alignment. 

Message delivery in PEMS2 is discussed using Alltoallv as an example. Rather 
than write/read messages to a separate area on disk as in PEMSl, all virtual pro- 
cessors record in a table where in their contexts they expect to receive incoming 
messages. Then, they deliver directly to other virtual processors' contexts on disk. 
Thus the additional communication area on disk (and with it a significant amount of 
I/O and disk seeking) is eliminated. 



^Note that L in this thesis differs from previous work on PEMS, see Appendix B.4 
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Algorithm 6.2.1: Simple-Direct-Alltoallv 
Data: S : Array of pointers to v messages to send 
Data: TZ : Array of pointers to v messages to receive 

1 Let T be a shared v x v table of incoming message offsets 

— Store message offsets — 

2 Store incoming message offsets from 7^ in T 
.3 Swap out 

— Finished Internal Superstep 1 — 

— Begin Internal Superstep 2 — 

— Deliver messages — 

4 Swap in 

5 foreach message mp_>j in S do 

6 1^ Write TRp^i to Tp-^i 

7 Swap out 

— Finished Virtual Superstep — 



This strategy avoids out-of-place message delivery, but is not an ideal solution 
for two reasons: 1/0 operations are not necessarily block aligned, and messages are 
written to disk and read again in cases where this can be avoided. 

One possible approach to eliminate alignment issues is to simply use buffered 
I/O. However, the caching and copying inherent to buffered 1/0 is not suitable for 
a system like PEMS which consumes as much main memory as possible. PEMSl 
resolved this by organizing all message data in an appropriate way in a separate area 
on disk. PEMS2 instead directly delivers the largest aligned portion of a message 
possible, and keeps a cache of remaining blocks which require "cleaning up". The key 
observation is that for a given message, a maximum of 2 blocks may not be properly 
aligned (namely the first and last block of the messag^. Since each virtual processor 
receives v messages, a given virtual processor must receive at most 2v unaligned 
blocks - dramatically less than the total message volume for a typical coarse-grained 
algorithm. 

The overhead inherent in buffered 1/0 is due to the fact that a portion of a block 
can not physically be written to disk. For example, if the first half of a block must 



^Note that while messages may be distributed in any way across the context, an individual 
message is a contiguous range 
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be written, the contents of the second half must be in memory so the complete block 
can be assembled and written to disk. Therefore, some sort of cache is required to 
avoid corrupting blocks when a partial block is written. We will need to emulate this 
behavior to achieve our goal, but are able to do so more efficiently than the generic 
cache mechanism in the kernel since we know precisely what the kernel must gues^ 
The general solution to this problem is trivial: simply read in the desired block 
from the destination context, modify it, and write it back out again. Unfortunately 
this is not sufficient for our purposes since two (or more, in cases with very small 
messages) messages can overlap a single block, which raises synchronisation issues 
when k > 1. The overhead associated with a read/write cycle is also undesirable. 
Instead, we will cache the blocks containing unaligned messages ends ("boundary 
blocks") in memory throughout the course of the AUtoallv call. As virtual processors 
deliver the bulk of their messages directly to their destinations, they update this cache 
with the remaining fragments of the delivered messages. Since this is done when the 
relevant contexts are already swapped in, the read/write cycle is avoided. Finally, 
when the bulk of all messages have been delivered each processor flushes the necessary 
boundary blocks from the cache in memory to its context on disk and the algorithm 
completes. 

Another issue arises with the direct delivery of messages in the absence of buffered 
I/O: while the source and destination of each message contain aligned regions of equal 
size, these regions may not have equivalent alignment (i.e. their start offsets are not 



equivalent mod B). Fig. 6.1 illustrates such a case (the top and bottom regions rep- 
resent the sender's and receiver's contexts, respectively). The largest aligned region 
within the message source does not correspond directly to the largest block-aligned re- 
gion within the message destination because their alignment differs. This is a problem 
because non-buffered I/O requires that a// offsets be block aligned, both in memory 
and on disk. To resolve this we take advantage of the fact that the source context is 
both in memory and on disk at write time, so we can destroy the context in memory 
and avoid swapping out to prevent corruption. We shift each message in memory left- 



ward so the regions in the sender and receiver align properly (Step 1 in Fig. 6.1 ). Thus 
aligned, these regions can be delivered directly, and the remainder of the message is 



^ "There are only two hard problems in Computer Science: cache invalidation and naming things." 
Phil Karlton 
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handled by the boundary block cache in memory. 
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Figure 6.1: Direct Message Delivery 



6.3 Disk Space Reduction 

In addition to reducing the volume of I/O performed, the elimination of the indirect 
area significantly reduces the amount of disk space required to run a given simulation, 
particularly with large numbers of virtual processors. This allows a given system 
configuration to handle larger problem sizes. 

In PEMSl, each real processor required ^ disk space for its local virtual processor 
contexts, and vfi disk space for the indirect area. Note that the size of the indirect area 
increases with v rather than ^. This has the effect of increasing disk space when real 
processors are added even if remains constant, which can be a significant scalability 
problem. The ideal strategy for scaling up a PEMS simulation is to determine the 
parameters that fully utilize the resources available to a single machine, then be able 
to easily add real processors as necessary to reach the desired problem size. An area 
on disk that scales with v rather than conflicts with this concept. In practice this 
makes scaling more tedious than necessary, since predicting the amount of required 



disk space is more difficult. Additionally, as experiments in ^8.3.3 show, this large 
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region of disk space can incur a serious performance penalty when /i is large, due to 
both disk seek time and file system overhead. 

During the course of a simulation the disk is continually reading and writing for 
swapping and message delivery. As a result the disk head must constantly seek across 
the entire region of disk space, including, in PEMSl, the huge indirect area. This 
occasionally had the counter-intuitive effect of making the simulation slower when 
more RAM was added and /i correspondingly increased, because the disk seek time 
dwarfed the time spent actually swapping a given context. 

To solve this problem, the improved simulation algorithms introduced in this 
thesis eliminate the indirect disk area entirely, so the amount of disk space required 
per real processor is precisely ^. As a result, real processors can be added to increase 
problem size without increasing the disk space requirement for each real processor. 
In practice, this makes tuning a PEMS simulation much more manageable. 



Fig. 6.3 illustrates the difference in disk space consumption between the two strate- 
gies. Even for a modest ^ the disk space requirements for PEMSl rapidly increase; 
in this case with 16 processors the disk space required of a single real processor ex- 
ceeds the total problem size. PEMS2, in contrast, only uses disk for virtual processor 
contexts, so the amount of disk space required precisely matches the problem size 
regardless of how many real processors are added. 
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Figure 6.2: Disk Space Requirements 



6.4 Communication Buffer Size 



Due to the removal of the indirect message area, PEMS2 does not require an upper 
bound on communication volume in order to allocate disk space, but communication 
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volume must still be bounded in many cases to avoid exceeding the available com- 
munication buffer. Note that one bound on communication volume is inherent: each 
virtual processor can send at most fi data in total, since each virtual processor has fi 
memory and messages must reside in that memory before being sent. 

Each virtual processor sends at most v messages in a communication superstep. 
The user may configure how many of these messages are sent at once using the 
parameter a. By choosing an appropriate value for a, the user may ensure there is 
always sufficient buffer space to handle communication. This strategy removes the 



need for indirect routing as in PEMSl (see ^2.3.3). Performance is therefore improved 



since each message is sent over the network precisely once, to its destination real 
processor. 

The amount of I/O performed by communication methods depends on the size of 
messages. To represent this in analytical results, the variable u is used to represent 
an arbitrary bound on virtual message size. Specific values may be substituted for u 
to find the run time for a particular call, or a particular computational model. For 
example, if a Beast is performed where the message is simply a single 32-bit integer, 
CO = A bytes; for a CGM algorithm, to = 0(^); etc. 

6.5 Scheduling and Disk Parallelism 

If each virtual processor is mapped to a single disk (i.e. striping or similar techniques 
are not in use), the runtime of a collective communication method depends on the 
order of execution of virtual processors. This is because virtual processors execute in 
synchronised rounds A; at a time, where each round includes a single virtual processor 
mapped to each memory partition (0. . .k — 1). However, this does not automatically 
imply that each round contains a virtual processor mapped to each disk. In the worst 



case, only a single disk may be used despite several disks being available. Fig. |6.3 
shows such a case: if the virtual processors shown in bold (0, 4, and 8) are executed in 
a round, only disk is used for that round and thus disk parallelism is not exploited. 

This problem must be addressed in order to precisely analyse the communication 
functions in PEMS2 and applications built with them. Restrictions on k and D could 
solve the problem, but this approach is not realistic since k and D reflect physical 
system characteristics. Defining the scheduler's behaviour such that these situations 
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Figure 6.3: Memory Partition and Disk Mapping (/c = 3, D = 2) 



are avoided is more flexible, and feasible to implement in practice. Conveniently, 
a trivial scheduling algorithm results in the desired behaviour: if virtual processors 
are executed in ID order, then message delivery is distributed across all disks. For 



example, with = 3 as in Fig. |6.3[ processors 0, 1, 2 would execute in the first round, 
3, 4, 5 in the next round, etc. If /c > Z), then clearly each round uses D disks in 
parallel (since an increasing sequence of k integers mod D contains all integers in 
0...D — lifA;>-D). \i k < then this is not the case, and virtual processor 
contexts should be distributed across disks to exploit disk parallelism. 

When each virtual processor context is distributed across disks, all disk I/O of 
sufficient size is fully parallel, so the scheduler behaviour need not be defined and no 
restriction is required of k and D. In this separate restriction is necessary: 

individual reads and writes must be large enough that they will be performed across 
all D disks. With a straightforward round-robin block distribution strategy as used 
by striped RAID systems and the STXXL block layer, this requires lo > BD for fully 
parallel message delivery, and yU > BD for fully parallel swapping. For any reasonable 
configuration, ^ BD. u may be < BD, but if this is the case messages are so small 
that full disk parallelism is impossible (since disks can not perform transfers smaller 
than B), so we will simply assume u > BD to simplify analysis. 

Def. 16.5.11 summarises these conditions. 



Definition 6.5.1 (Fully Parallel Swapping). // each virtual processor context resides 
on a single disk, k > D, and virtual processors are scheduled in increasing order by 
ID, then PEMS2 performs all swapping 1/ O across all D disks in parallel. 
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// each virtual processor context is distributed evenly across all disks in a blockwise 
fashion, and fi > BD, then PEMS2 performs all swapping I/O across all D disks in 
parallel. 

Unfortunately, this behaviour conflicts with the potential swapping optimisation 
described in §2.3.1 where k swaps can be avoided at each virtual superstep barrier. 



Accordingly, the run times given in this thesis do not include that optimisation. 
6.6 Allocation 

When PEMS is initialised it first allocates all memory required by virtual processors. 
It then intercepts allocation requests from the simulated algorithm and satisfies them 
by allocating the requested memory from this pool. 

To fully support dynamic memory allocation and deallocation, PEMS2 uses a 



more sophisticated allocation scheme than PEMSl (see ^2.3.4). All virtual processor 
memory is still contained within a single region of size yU. Unlike PEMSl, however, 
PEMS2 stores the offset and size of each allocation. This enables freeing of allocated 
memory, which can then be reused by future allocations. 

The allocation records are stored using a simple balanced binary search tree in 
memory. Since the number of allocations is relatively small and the overhead of this 
data structure is not significant compared to disk 1/0, a more sophisticated structure 
would not likely show any significant improvement. 

The allocation algorithm is simple: search from the lowest address until a large 
enough free chunk is found, then split the start of this chunk into a newly allocated 
area of appropriate size. 

Deallocation is also straightforward: remove the allocated chunk, and merge with 
any adjacent free chunks. If there are no adjacent free chunks, simply record the area 
as deallocated. 

More sophisticated strategies are of course possible; efficient allocation with min- 
imal fragmentation is a much researched problem. In the context of PEMS, however, 
the most important benefit of an allocator over the basic design of PEMSl is the 
ability to re-use deallocated memory, and avoid I/O for currently unallocated mem- 
ory regions. The relatively simple allocator presented here, though not optimal with 
respect to fragmentation, does provide these two advantages. 
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Figure 6.4: Memory Allocation in PEMS2 
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Figure 6.5: Memory Deallocation in PEMS2 

The swapping related I/O function in PEMS2 have been modified to only swap 
currently allocated regions of memory, rather than swap the entire partition in a 
single read/write operation as in PEMSl. As a result, programs which free memory 
as soon as possible see improved performance due to less I/O. For programs with very 
dynamic allocation behaviour, this can amount to a significant reduction in I/O and 
total run time compared to the PEMSl strategy. 



Chapter 7 



New and Improved Communication Algorithms 

7.1 Alltoallv 

In an Alltoallv, every virtual processor sends a message of arbitrary size to ev- 
ery other virtual processor, thus messages are exchanged in total. Alltoallv is 
the most powerful collective communication operation implemented in PEMS that 
performs only communication (i.e. new values are not computed as part of the oper- 
ation) . 

Due to the complexity and size of the EM-Alltoallv algorithm, the single- 
processor and multi-processor versions are presented here separately. These are re- 
ferred to as EM-Alltoallv-Seq and EM- Alltoallv- Par, respectively. The algo- 
rithm in general (i.e. for both single processor and multi-processor cases) is referred to 
as EM- Alltoallv. Note the implementation makes no such distinction and simply 
provides an implementation of the MPI_Alltoallv function that works in both cases. 




Figure 7.1: Alltoallv Operation 



47 

7.1.1 Single Processor 
Algorithm 



Alg. 7.1.1 describes the single processor implementation of Alltoallv in PEMS2. 
Note that all 1/0 (including swapping) is explicitly performed, superstep barriers 
do not imply swapping. This algorithm and the others in this section perform fine- 
grained swapping, e.g. "Swap message in" means the message (which resides in the 
virtual processor's context) should be swapped in from disk to its usual location in 
the memory partition, just as if the entire partition was swapped in. 

The reader is encouraged to review the simpler implementations of Alltoallv 



(Alg. |2.2.1 and Alg. 6.2.1 ), since the algorithm given here solves the same problem in 



a similar but more intricate way. 
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Algorithm 7.1.1: EM-Alltoallv-Seq 



Data: S : Array of pointers to v messages to send 
Data: TZ : Array of pointers to v messages to receive 

1 Let T be a shared t; x i> table of incoming message offsets 

2 Let S he a shared array of v execution states, all initially false 

3 Let At be a cache of at most 2v'^ border blocks {2v per virtual processor) 

— Store message offsets and synchronise — 

4 Swap out everything except regions in TZ 

5 Store incoming message offsets from 7^ in T — is valid 

6 Set Sp to true — This thread has reached this point 

7 Synchronise with the k — \ other currently running threads 

— Deliver messages if possible — 

8 foreach message rup^i in S do 
Update M. with the start and end of this message 
if £i is true then — Thread i has recorded its offsets in T 
1^ Align and deliver directly to Tp^i on disk 



9 
10 

11 



— Finished Internal Superstep 1 — 

— Begin Internal Superstep 2 — 

— Deliver remaining messages — 

12 foreach message m^^j in S not delivered in superstep 1 do 

13 Swap message in 

14 Align and deliver directly to Tp^i on disk 

— Finished Internal Superstep 2 — 

— Begin Internal Superstep 3 — 

— (Blocked I/O only) Flush border block cache — 

15 Flush border blocks in M. to our context 



— Finished Virtual Superstep — 
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Analysis 



Similar disk parallelism issues arise in the analysis of EM-Alltoallv as those de- 
scribed in §6.5 but with respect to message delivery rather than swapping. Since 
message delivery, like swapping, happens in the same order as virtual processor exe- 



cution, the same arguments used for swapping (Def. 6.5.1) apply to message delivery 



as well. Def. |7.1.1| summarises the necessary conditions for communication methods 
that, like EM-Alltoallv, perform message I/O to/from all virtual processors. 



Definition 7.1.1 (Fully Parallel Message Delivery). If each virtual processor context 
resides on a single disk, k > D, and virtual processors are scheduled in increasing 
order by ID, then a communication function which performs message I/O to/from all 
virtual processors does so across all D disks in parallel. 

If each virtual processor context is distributed evenly across all disks in a blockwise 
fashion, and u > BD, then a communication function which performs message I/O 
to-from all virtual processors does so across all D disks in parallel. 



Definition 7.1.2 (Fully Parallel I/O). A communication function has "fully parallel 



I/O" if it has both fully parallel swapping (Def. 6.5.1) and fully parallel message 



delivery (Def. 7.1.1) 



Lemma 7.1.3. When used with explicit I/O, EM-Alltoallv- Seq performs vn + 
+ 2v^B I/O. 



Proof. The fundamental difference between Alg. 7.1.1 and Alg. 2.2.1 is that the 
amount of I/O performed by a given virtual processor depends on how many vir- 
tual processors have finished executing previously. 
Let 5 be the number of messages delivered directly on line 11. 
Let L be the number of messages delivered indirectly on line 14. 
In lines 8. . .10, threads deliver directly to all threads that have completed Internal 
Superstep 1. Since threads execute in synchronised rounds at a time, the first round 
of k threads each deliver k messages directly, the next round 2k (since 2k threads have 
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now run), the next round 3k, etc. Hence: 

i- 

= k 



i=l 
.2 



k \ 2 

! + i 



vk 



f ^ + vk 



L = v'-6 



The remaining analysis is straightforward: 



'seq 



h + hi + /i3..i4 + /: 



15 



{vfi - v^u) + {6u) + {2Ltu) + {2v^B) 
v/j, - v'^u + 5u} + 2v^uj - 26uj + 2v'^B 

' V"^ + Vk\ n „ 



Vfl + V UJ — 

- vk 



Vjl + 



V 



UJ + 2v'^B 



□ 



Corollary 7.1.4 (Improvement). When used with explicit I/O, EM-Alltoallv- 

Seq performs 2vfi + ^f^+^fc ^ _ 2v'^B less message delivery I/O per virtual superstep 



than Alg. 2.2.1 fPEMSl-ALLTOALLV-SEQ;. 



Proof. 



-^orig-seq -^seq 



^ 3v'^ + vk 2d 
2vix H bj - 2v^B 



f ^ — vk 



bj + 2v^B 



51 



□ 

Lemma 7.1.5. EM-Alltoallv-Seq uses at most shared buffer space. 

Proof. The only buffer space used is for the block cache, when direct I/O is in use. 
Each of the local virtual processors has 2 blocks in the cache for each of its v 
received messages. □ 



Theorem 7.1.6. Given fully parallel I/O (Def. 7.1.2), EM-Alltoallv-Seq takes 



+ + + L time. 



Proof. Follows directly from Lem. 7.1.3 since EM-Alltoallv-Seq performs no net- 



work communication and no significant computation. □ 
Benchmarks 



Fig. |7.2| shows the run time of a single call to EM- Alltoall-Seq for various numbers 
of 32-bit integers. The x-axis represents total problem size as a number of 32-bit 
integers, and the y-axis represents total run time. Times are shown for both memory- 
mapped ("mmap") and explicit ("unix") I/O, for k = 1 and k = A cores (e.g. alltoall- 
mmap-kl represents memory- mapped I/O with 1 core). 

No action is performed by the program other than a single Alltoallv on the 
complete data set. Note in particular the performance improvement seen with UNIX 
I/O when using 4 cores compared to using a single core. Since the test program 
performs no significant computation, this shows that the run time of EM- Alltoall- 



Seq itself improves when k increases, as Thm. 7.1.6 predicts (since the vk term in 
the message delivery time is subtracted). 

The situation is reversed with memory mapped I/O, due to the overhead of the 
operating system's cache mechanism. This is not surprising: since communication 
primitives in PEMS2 are carefully tuned to minimise I/O, explicit I/O will always 
result in better performance for a trivial program that simply calls a collective com- 
munication function once. The potential benefit of memory-mapped I/O is that the 
operating system's cache mechanism can avoid a large amount of I/O for certain 
programs, but that is not the case here. 



Note that the experiment shown in Fig. 7.2 is a trivial program that does not 



represent a case where multi-core or memory-mapping are expected to show much 
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benefit. Fig. 7.2 is not intended to illustrate the improved performance of PEMS2, 
only that an improvement is seen when using multiple cores in spite of the fact that 
no computation is performed. Experiments to illustrate the improvements in PEMS2 
for realistic use cases are shown in Chapter Isl 
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Figure 7.2: Single Processor EM-Alltoallv Performance 
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7.1.2 Multiple Processor 
Algorithm 



Alg. 7.1.1 describes the multiple processor implementation of Alltoallv in PEMS2. 
Local message delivery occurs in an identical manner as in the single processor case. 
Remote messages are handled in the second internal superstep, when all message 
destinations are known (since they have been recorded in the shared table in the 
previous internal superstep). Using this information, virtual processors receive on 
behalf of their local peers and deliver directly to their contexts on disk. 



Analysis 

Lemma 7.1.7. EM-Alltoallv-Par takes + communication time, as- 
suming akco > b, with g, b, and I as in the BSP model (see Appendix\B^. 

Proof. Each virtual processor must send one message to each other virtual processor, 
and all messages are sent directly to the real processor that hosts the destination 
virtual processor. 

All communication performed by EM-Alltoallv-Par is performed by EM- 
Alltoallv-Par-Comm. This algorithm sends messages from the "round" of Pk 
currently executing virtual processors in "chunks" of size a (where a is a user-defined 
parameter indicating the number of messages to send at once, 1 < a < v). 

In each round, k virtual processors are active on each real processor, and each 
of these virtual processors sends v messages over the network^ Thus, each round 
consists of separate a/cw-relations. 

2 

such rounds occur, therefore there are such relations in total. □ 

Lemma 7.1.8. When used with explicit I/O, EM-Alltoallv-Par performs ^ -|- 
- + 1^ - 1^) a; + 2v^B I/O. 

Proof. The local message delivery of the parallel version of EM-Alltoallv-Seq is 
identical to that of EM-Alltoallv-Seq, therefore this portion of the analysis (/4, 
hi, As.. 14, and /15) is identical except with ^ local virtual processors rather than v. 

^For simplicity of analysis, messages to local virtual processors are included in this figure though 
in reality this is optimised away and each virtual processor sends v — messages over the network 
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The remaining I/O is / 



17.. 18 



for the received network messages. 
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□ 



Lemma 7.1.9. EM-Alltoallv-Par uses at most + akco shared buffer space. 



Proof. The size of the block cache is equivalent to the sequential case (Lem. 7.1.5). 
Additional space is used by EM-Alltoallv-Par-Comm to assemble messages con- 
tiguously for communication. The "chunk size", a, is a user parameter which controls 
the amount of buffer space used for this purpose: at most akuj buffer space is used. □ 



Theorem 7.1.10. Given fully parallel I/O (Def. 7.1.2), EM-Alltoallv-Par takes 



S 



Vfl 

PDB 



+ G{4 + 



2P 



+ G2v'B + + + L time. 



PDB 



Pka 



Proof. Follows directly from Lem. 7.1.8 and Lem. 7.1.3 



□ 
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Algorithm 7.1.2: EM-Alltoallv-Par 



Data: S : Array of pointers to v messages to send 
Data: TZ : Array of pointers to v messages to receive 

1 Let T be a shared v x ^ table of incoming message offsets 



2 



9 
10 
11 



Let £ be a shared array of execution states, aU initiaUy false 
Let be a cache of at most ^ border blocks {2v per local thread) 



p 

— Store message offsets and synchronise — 

4 Swap out everything except regions in 71 

5 Store incoming message offsets from 7^ in T — is now valid 

6 Set Sp to true — This thread has reached this point 

7 Synchronise with the k — 1 other currently executing local threads 

— Deliver messages if possible — 

8 foreach local message mp_>j in S do 
Update Ai with the start and end of this message 
if Si is true then — Thread i has recorded its offsets in T 
1^ Align and deliver directly to Tp^i on disk 



— Finished Internal Superstep 1 — 

— Begin Internal Superstep 2 — 

— Deliver remaining messages — 

12 foreach local message mp_>j in S not delivered in superstep 1 do 
Swap message in 

Align and deliver directly to Tp^i on disk 
15 Communicate using Alg. 



7.1.3 



16 foreach remote message nii^j received do 

17 Update M. with the start and end of this message 

18 Align and deliver directly to Ti^j on disk 

— Finished Internal Superstep 2 — 

— Begin Internal Superstep 3 — 

— (Blocked I/O only) Flush border blocks — 

19 Flush border blocks in M. to our context 



— Finished Virtual Superstep — 
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Algorithm 7.1.3: EM-Alltoallv-Par-Comm 



Data: S : Array of pointers to v messages to send 
Data: TZ : Array of pointers to v messages to receive 
Data: T : Shared x ^ table of message offsets 

1 Let a be the network "chunk size" parameter 

2 Let B be the shared communication buffer 



3 foreach i in 0, a, 2a, 3a, . . . , — 1 do 



4 



' P 

Assemble messages to threads i . . .i + a — 1 on each real processor 
contiguously in B 

if this is the last of k threads to reach this point then 

1^ Send/Receive assembled messages with MPI_Alltoallv 

Deliver received messages using T 
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7.2 Beast 

In a BcAST (broadcast), a single root virtual processor sends a single message to 
every other virtual processor, i.e. every virtual processor receives the same individual 
message. 









Root 































Figure 7.3: Beast Operation 



The PEMS2 implementation of Beast uses rooted synchronisation on the real pro- 
cessor that contains the root, and initial synchronisation on all other real processors 



(see p.3| ). 

All threads on the same real processor as the root wait for the root to write the 
message to the shared buffer. They then copy the message from the shared buffer to 
their receive buffer. 

The message is delivered remotely using a single MPI_Bcast: the root sends, and 
the first virtual processor to run on other real processors receives into the shared 
buffer. This receiving virtual processor then signals, and other virtual processors 
copy from the shared buffer to their receive buffers. 
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7.2.1 Algorithm 



Algorithm 7.2.1: EM-BCAST 



— Signal that data is ready 
Send to other real processors 



Data: S : Send buffer of size oj (valid only at root) 
Data: TZ : Receive buffer of size u! 

Let B be the initial portion of shared buffer of size co 

— Broadcast — 
if this is the root then 
Copy S to B 
EM-Signal-Threads() 
if P > 1 then 
[_ MPLBcast from S - 

else — This is not the root 

if the root is on this real processor then 
EM-Wait-For-Root() 
Copy from shared buffer to TZ 

else root is on another real processor 

if P > 1 and EM-First-Thread() then 

MPLBcast to B (receive from root) 
EM- Sign AL- Threads 

EM-Wait-Threads() 
Copy from B to TZ 



— Finished Virtual Superstep — 
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7.2.2 Analysis 

Lemma 7.2.1. EM-Bcast takes S-^^ ~^^yBB ^^"^^ perform I/O (not including 
virtual superstep overhead). 



Proof. EM-Wait-For-Root takes Sj^ time (Lem. 4.3.1), and eacli virtual pro- 



cessor delivers the buffer (of size u) to its context. □ 



Lemma 7.2.2. EM-Bcast performs a single network oo-relation, where oo is the size 
of the buffer to broadcast. 

Proof. MPI_Bcast is called exactly once with a buffer of size u. □ 



Theorem 7.2.3. EM-Bcast takes Sfj^ + Gy^ + gf + l + L time where u is the 
size of the buffer to broadcast, assuming vuj > B and u > h. 



Proof. Follows directly from Lemma 7.2.2 and Lemma 7.2.1 , since extra swapping 
only occurs for a single virtual processor and message delivery occurs for all virtual 
processors in parallel. □ 



60 



7.3 Gather 

In a Gather, each virtual processor sends a message to the root virtual processor. 





















r 








Root 











Figure 7.4: Gather Operation 



The PEMS2 implementation of Gather uses final synchronisation (see ^4.3). In 
both the single and multiple processor cases, the gathered messages are assembled in 
the shared buffer before finally being collected by the root virtual processor. 

In the single processor case, the virtual processors simply copy to the appropriate 
location in the shared buffer, then signal. When all threads have signalled, the root 
copies the result from the shared buffer to its receive buffer and the operation is 
complete. 

In the multiple processor case, each virtual processor participates in an MPI_Gather 
to send data to the real processor which hosts the root. When these communication 
rounds are completed, all gathered data resides in the shared buffer at the real pro- 
cessor which hosts the root. The root virtual processor then copies this data to its 
receive buffer and the operation is complete. 



7.3.1 Algorithm 
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Algorithm 7.3.1: EM-Gather 



Data: S : Send buffer of size u 

Data: TZ : Receive buffer of size vu (valid only at root) 

Let B be the initial portion of shared buffer of size vou 
Let y = false (yielded, swapped out) 
if P > 1 then 

if the root is on this real processor then 

MPI_Gather P ranks of current senders (receive) 
MPI_Gather(5, B) (receive) 
if this is the root then 

if not EM-ALL-THREADS-FlNlSHED(t/) then 
[_ EM-Wait-Threads(|/) 

Copy data from B to TZ 

else 

EM-Thread-Finished() 



L 



else — Root is not on this real processor 
MPI_Gather P ranks of current senders 
MPI_Gather(5) (send) 



send) 



else — Single processor 
if this is the root then 

if not EM-ALL-THREADS-FlNlSHED(y) then 

[_ EM-WAIT-THREADS(y) 

if y then 

1^ Swap in 71 

Copy S to B 
Copy S to 7^ 

else — This is not the root 
Copy 5 to B 

EM-Thread-Finished() 



— Finished Virtual Superstep — 
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7.3.2 Analysis 

Lemma 7.3.1. EM-Gather takes at most S^^^ time to perform I/O (not including 
virtual super step overhead), assuming u > B. 

Proof. In the multi-processor case, the root may swap its context out via EM-Wait- 
Threads I/O since only the root calls this function). In this case, TZ is not 
swapped in at line 10, so this copy is actually a disk write {u I/O), for a total of 
fi + u I/O in the worst case. Other virtual processors perform no additional I/O. The 
single processor case clearly performs less I/O in the worst case. □ 



Lemma 7.3.2. EM-Gather takes g"^ + communication time, assuming uj > h. 

Proof. EM-Gather performs an MPI_Gather for P threads at once, where each 
real processor sends one message of size u to the root. Thus, EM-Gather performs 
^ network w-relations. The lemma follows directly from the definitions of /, g, and h 
in the BSP model. □ 



Theorem 7.3.3. EM-Gather takes + g^ + + L time, assuming u > b 



and Lo > B. 



Proof. Follows directly from Lem. 7.3.1 and Lem. 7.3.2 



□ 
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7.4 Reduce 



A Reduce operation applies an associative and commutativ^ operator to v values 
(one from each virtual processor), placing the single result in a buffer on some root 
virtual processor. This operation is vectorized across n values, i.e. a single Reduce 
performs n reductions of size v resulting in n values at the root. (This definition 
corresponds to MPI_Reduce, which is not implemented by PEMSl). 

A Reduce can be performed with less I/O and communication than an AUtoall, 
since several values may be reduced to a single value before delivery. On each real 
processor, for each of the n reductions, EM- Reduce performs k operations at a time 
in parallel into the shared buffer, a must therefore be large enough to hold kn values. 
When all local threads have finished, the final thread reduces these k values to a single 
value, and communicates that value to the root. This requires 2 internal supersteps, 
but only a single swap and a single network communication (if necessary). 
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Figure 7.5: EM-Reduce (P = 2, = 8, A; = 2, and n = 4) 



^MPI allows the user to define non-commutative operators, but PEMS currently requires opera- 
tors to be commutative 
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7.4.1 Algorithm 



Algorithm 7.4.1: EM-Reduce 



Data 
Data 
Data 



S : Array of n values to send 



71 (root only) : Array of n values for result 
r (root) : ID of root virtual processor 

1 Let B be one of k shared buffer portions of size n 

— Partially reduce local data — 

2 Swap out 

3 Reduce 5* into B 

— Finished Internal Superstep 1 — 

— Begin Internal Superstep 2 — 

— Merge partial reductions — 

4 if p = r or p is thread on a different real processor than r then 
Reduce kn values in shared buffer to n local results 
if P > 1 then 

1^ MPI_Reduce n local results to r's real processor 

8 if p = r then 

9 if P > 1 then 

10 1^ MP I .Reduce Pn values from the network into TZ 

11 Swap R out to partition on disk 

— Finished Virtual Superstep — 



7.4.2 Analysis 

Lemma 7.4.1. EM-Reduce takes ^ + nk computation time to reduce all real pro- 
cessor's local values to a local result. 

Proof. All operations are performed on vectors of size n. For each of these n ele- 
ments: Each real processor first reduces values on k cores in parallel, resulting in 
k intermediate values (Step 1 in Fig. 7.5), which takes time. Then, these k inter- 
mediate local results are combined by application of the reduction operator (Step 2 
in Fig. 7.5), which takes k time. Thus, it takes -^ + k time to reduce a vector of size 
1, or + rafc time to reduce a vector of size n. □ 

Lemma 7.4.2. EM-Reduce takes time to perform I/O (not including virtual 
superstep overhead). 
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Proof. The root processor delivers the final result of size nu to its context on disk. 
Precisely one swap occurs per virtual processor, which is accounted for by by L if 
necessary. □ 

When P > 1, EM-Reduce performs a single network MPI_Reduce operation. The 
precise communication and computation time may vary between MPI implementa- 
tions, but we can find reasonable bounds by assuming the MPI implementation is at 
least as good as the "obvious" algorithm, as described in Lem. 7.4. 3[ 



Lemma 7.4.3. A reasonable MPI_Reduce implementation on a switched network re- 
duces nP values across P processors (Step 3 in Fig. 7.5) in n \g[P) + g ^^^^^^ + l\g[P) 
time, assuming nuj\g{P) > b. 

Proof. MPI .Reduce can be implemented as a parallel tree reduction to achieve loga- 



rithmic time, as shown in Fig. 7.6 The result is computed as lg(P) parallel partial 
reductions. Each partial reduction combines two vectors of length n {n lg(P) total 
time), and is the result of sending a single vector of n values each of size u over the 
network {g\g{P)'^ total time). □ 



IP 2P 3P 4P 
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Figure 7.6: Logarithmic MPI_Reduce 



Theorem 7.4.4. EM-Reduce takes G^+g 
time. 



+ 1 lg(P) + n lg(P) + fl + nA; + L 



Proof. Follows directly from Lem. 7.4.1, Lem. 7.4.2, and Lem. 7.4.3 



□ 
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7.5 Summary 



Operation Buffer Space 

Beast uj 

Gather vui 

Reduce kn 

AUtoallv-Seq ^ 

AUtoallv-Par + aku 

Figure 7.7: Communication Algorithm Buffer Space 



Operation Time 

Beast S^ + G^ + g'i^ + l + L 

Gather 5^ +g'% + lj, + L 

Reduce + g"^^^^ + / lg(P) + n lg(P) + || + nA; + L 
Alltoallv-Seq + G^^o; + + L 



Alltoallv-Par + + It - H - pSb + ^2t;25 + + + L 

Figure 7.8: Communication Algorithm Run Time 



Chapter 8 
Experiments 



8.1 Experimental Setup 



All experiments in this chapter were performed on the HPCVL cluster described in 



detail in Appendix C.l 



8.2 Plot Style 

Labels for plot lines show the program name, PEMS version, 1/0 style, and number of 
processors. For example, "PSRS PEMS2 (mmap) P=2" refers to the PSRS algorithm 
running on PEMS2 with mmap 1/0 on 2 real processors. The three types of 1/0 style 



referred to are shown in Fig. 8.2 



Label 1/0 Style 



unix 

stxxl-file 

mmap 



Synchronous UNIX 1/0 

Asynchronous STXXL File 1/0 (^ jslf 
Memory Mapped 1/0 (^js^ 



Figure 8.1: PEMS2 1/0 Styles 



The label "stxxl" refers to STXXL's included sorting algorithm, run on a sin- 
gle processor (the program does not support distributed processors). This data is 
included on all plots to provide a consistent baseline for comparison of other results. 

Variables shown below the plot (e.g. /i or ^) apply to all runs shown in that plot, 
with the exception of the "stxxl" data. 
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8.3 Sorting 

The well-known Parallel Sorting by Regular Sampling [17\ (PSRS) algorithm is a good 
candidate for use with PEMS and explicit I/O due to its coarse granularity and small, 



constant number of supersteps. Alg. |8.3.1 shows a simple high-level description of 



this algorithm, with calls to collective communication functions (i.e. calls that result 
in I/O via PEMS) shown in bold. 



8.3.1 Algorithm 



Algorithm 8.3.1: PSRS 
Data: V (data) : Array of size ^ 

1 Sort V 

2 Choose V equally spaced splitters in V 

3 Gather all f ^ splitters at root 

4 Sort all v'^ splitters at the root 

5 Beast splitters evenly to all processors (each receives v splitters) 

6 Locate splitters in (sorted) V 

7 Compute the number of elements in V in each bucket 

8 Alltoall bucket sizes (each sends/receives v sizes) 

9 Alltoallv buckets to final destination processor 
10 Merge received buckets 



8.3.2 Analysis 

Let 71 be the size of an integer used for counts. 
Let e be the size of an individual data element. 



Alg. 8.3.1 consists of four supersteps, the first three of which communicate only 
counts of a fixed size. 

The remaining call, Alltoallv, does all the work of distributing data among pro- 
cessors. The message sizes in this step therefore depends on how balanced the global 
partitioning is. The partitioning scheme used in PSRS guarantees the balancing is 
within a factor of 2 of an ideal partitioning [17] We can assume, then, that the worst 
case virtual message size for this final step is a; < 
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Operation Size 
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8.3.3 Performance 

PEMSl vs. PEMS2 Time (Scaling v) 

To allow direct comparison with previous experimental results for PEMSl [TB], these 
experiments use a small virtual processor context size, /i = 64 MiB, with an additional 
64 MiB for the shared buffer. Because direct I/O is used the operating system does 
not use extra RAM for caching, i.e. performance is as if the system had only this 
amount of RAM and is unaffected by additional memory which goes unusecQ 

The context size, fi, remains constant for all runs, while the problem size is in- 
creased via V. This is the ideal way to scale PEMS: choose the memory parameters 
suitably for the available hardware, then scale v up to reach the desired problem size. 

The PEMSl and PEMS2 programs are identical, and experiments were run on 
the same machines with identical configuration and PEMS parameters. 

As the figures in this section show, PEMS2 is significantly faster than PEMSl, 
particularly with several real processors. When run with 8 processors, PEMSl is still 
not competitive with STXXL, taking over twice as long. PEMS2, however, is faster 
than STXXL with 8 processors, and very close in speed with 4. PEMS2 also scales 
better than PEMSl, with a slope nearly identical to that of STXXL. In contrast, 
the performance gap between PEMSl and STXXL gets larger as the problem size 
increases. 



^This has been verified by monitoring resource consumption and running identical experiments 
on machines with reduced RAM 
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PEMSl vs. PEMS2 Speedup 



Fig. 8.6 shows the relative speedup of the experiments in the previous section where 
n = 4 biUion. The speedup shown is relative to the sequential execution of the same 
system, i.e. PEMSl speedup is relative to PEMSl with P = 1 and PEMS2 speedup 
is relative to PEMS2 with P = 1. 

This figure illustrates that PEMS2 performance improves as real processors are 
added significantly more than that of PEMSl. 



CD 
ft 

cn 




PSRS PEMSl (unix) - 
n = 4096000000 , v = 512 



PSRS PEMS2 (unix) 
8000000 ,12 = 64 MiB 



Figure 8.6: PEMSl vs. PEMS2 PSRS Relative Speedup 



PEMSl vs. PEMS2 Time (Scaling /i) 



The results in ^ 8.3.3 confirm the hypothesis that always delivering directly to virtual 
processor contexts on disk results in improved performance compared to delivering 
indirectly via a separate area on disk. However, the parameters used here are not 
realistic - modern machines have far more RAM than 128 MiB. While these smaller 
runs can be extrapolated to estimate the performance of runs using more RAM (as 
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mentioned in previous work [13] (TB]), when comparing PEMSl and PEMS2, the con- 
text size has an impact on performance due to the differing disk layouts and dehvery 
strategies. 



As described in ^ 2.3.2 a significant motivation factor for the new direct dehvery 
strategy was the reduction of disk seeking. Because virtual processor contexts and 
the message area reside on separate areas of disk in PEMSl, during a simulation the 
disk must seek back and forth between these (possibly very distant) areas in order 
to perform swapping and delivery. This effect should increase as fi increases, since 
increasing /i increases the distance from a context to the indirect well as the 

distance between each region of the indirect area itself. 



Fig. |8.7| shows the results of an experiment with the PSRS algorithm that confirms 
this observation. In this experiment, the context size (/x) increases, while v remains 
constant. Thus there is an equal number of virtual processors for every run, but each 
virtual processor handles a larger number of elements. The results clearly show that 
PEMS2 scales significantly better than PEMSl with respect to increasing /i. This is 
an important observation, since experiments using small contexts do not illustrate this 
aspect of performance. Because modern machines have several GiB of memory, the 
experiments in the previous section (and similar experiments in the PEMSl literature) 
do not realistically reflect the performance of PEMS when used for practical purposes, 
where it is desirable to use as much RAM as possible to maximise performance. 
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PEMS2 Large Runs 



The dramatically different slopes in Fig. 8.7 suggest that PEMS2 should be more 



suited to exploiting the full capabilities of modern machines. In order to investigate 
this, the experiments in this section use a more realistic context size, fi = 1 GiB. 
Each machine has 4 cores {k = 4), thus 4 GiB RAM per machine is used for virtual 
processor contexts. An additional GiB is used for the shared buffer, for a total of 5 
GiEH 

These runs illustrate performance with much larger problem sizes: up to roughly 
32 billion (32 • 10^) 32-bit integers; or about 119 GiB of data. Because the PSRS 
algorithm requires twice the space in order to sort, as well as additional space for 
counts, this amounts to well over 200 GiB of data; significantly larger than the total 
amount of physical memory available on the machines used. 

The improved performance of PEMS2 with larger context sizes can be seen by com- 



paring performance to STXXL with the small context results in ^8.3.3 (the STXXL 



data is identical). In those experiments with small contexts, PEMS2 only surpasses 



STXXL performance at P = 8 (Fig. 8.5). In the experiments here with larger con- 



texts, PEMS2 surpasses STXXL performance at P = 4 (Fig. 8.10), and is very close 



when P = 2 (Fig. 8.9), a significant improvement. 

Direct comparison of individual runs with similar problem sizes illustrates the 
improvement clearly. For example, with P = 8, PEMS2 with small contexts takes 



1441 seconds to sort 4 billion elements (Fig. 8.5). PEMS2 with large contexts takes 



only 704 seconds to sort 4 billion elements (Fig. 8.11), more than twice as fast as the 



comparable run with small contexts. PEMSl with small contexts takes 3925 seconds 



to sort 4 billion elements (Fig. 8.5), making PEMS2 with large contexts more than 



5 times as fast. Considering the fact that PEMS2 scales better than PEMSl with 



respect to context size (Fig. 8.7), it is clear that PEMS2 is a significant improvement 
over PEMSl for practical applications which utilize the resources available on modern 
hardware. 

Performance is best, and most predictable, when using UNIX I/O. Memory- 
mapped I/O performs significantly worse, though this is not surprising since PSRS is 



^Of course, a small amount of additional memory is used for internal control structures, thread 
stacks, the operating system, etc. 
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not a good choice of algorithm for memory mapping: most memory is used in all steps, 
so caching provides little benefit but a large amount of overhead. More surprisingly, 
asynchronous STXXL 1/0 does not outperform the synchronous UNIX 1/0 with the 
exception of a few runs. If the n = 32 billion data point is ignored, the asynchronous 
performance for P = 8 looks promising; it may be the case that further optimisation 
of the implementation to increase I/O, computation, and communication overlap will 
allow asynchronous 1/0 to show a consistent improvement over synchronous 1/0. 
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Internal Benchmarks 

PEMS includes a more fine-grained benchmarking system which plots the time of a 
program's execution at each superstep barrier. This can be used to precisely investi- 
gate the run time of a program and see which sections of code spend the most time. 



This data is shown in Fig. 8.12 Fig. 8.14 and Fig. 8.13 for a PSRS run using unix 



mmap, and stxxl-file I/O, respectively. These plots clearly illustrate where time is 
consumed among the 4 communication calls in the PSRS algorithm. 

Each plot shows a single PSRS execution on a single real processor (runs are using 
2 real processors and the same parameters as in the previous section, but each plot 
shows only a single real processor). Each line represents the elapsed time of a single 
thread. Because is relatively high for these runs, there are many lines on these plots 
and distinguishing each individually is difficult. However, it is the overall trend and 
distribution that is interesting in these plots, not the time taken by any particular 
thread. 

These plots illustrate the fundamental performance difference between explicit 
and memory mapped 10. The UNIX and STXXL plots have similar structures - time 
increases in jumps at each superstep, roughly corresponding to the amount of I/O 
performed in that superstep. MMap, however, differs significantly: elapsed time is 
nearly fiat until the final AUtoallv. This illustrates the effect of caching and benefits of 
memory mapping - the first 3 steps deal with only splitter data, thus a small amount 
of data is accessed each step. This allows the cache to work effectively, keeping the 
splitter data in memory and avoiding I/O. The advantage of memory-mapped I/O in 
PEMS is clearly visible: these steps take almost no time at all, because no swapping 
I/O is performed. The final Alltoallv, however, moves all the data to its final 
location, accessing the majority of the virtual processor's memory. This data is large 
and not cached, so a large amount of I/O is performed. This last step which accesses 
and moves the majority of memory on every virtual processor causes PSRS to not see 
much overall performance advantage with memory-mapped I/O. Other algorithms 
which communicate in smaller chunks would see a significant improvement in overall 



runtime, as the first 3 steps in Fig. |8.14| illustrate. Thus, memory mapping expands 
the scope of PEMS: with explicit I/O as in PEMSl, only algorithms like PSRS with 
a small number of very coarse supersteps are appropriate. PEMS2 with memory 
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mapped I/O, however, can see good performance with algorithms that use a large 
number of supersteps and finer grained communication, since a superstep barrier no 
longer forces a complete swap. 
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8.4 CGMLib 

CGMLib/CGMGraph jl] (here collectively referred to as "CGMLib") is an implemen- 
tation of several CGM algorithms and associated utilities. CGMLib is a high-level 
object based C++ library implemented on top of MPI, which implements several 
communication methods: 

oneToAllBCastCint source, CommObjectList &data) Broadcast the list data from 
processor number source to all processors. 

allToOneGather (int target, CommObjectList &data) Gather the lists data from 
all processors to processor number target. 

hRelation(CommObjectList fedata, int *ns) Perform an h-Relation on the lists 
data using the integer array ns to indicate for each processor which list objects 
are to be sent to which processor. 

allToAllBCast (CommObjectList fedata) Every processor broadcasts its list data 
to all other processors. 

arrayBalancing (CommObjectList fedata, int expectedN=-l) Shift the list elements 
between the lists data such that every processor contains the same number of 
elements. 

partitionCGM(int groupid) Partition the CGM into groups indicated by groupld. 
All subsequent communication operations, such as the ones listed above, operate 
within the respective processor's group only. 

unPartitionCGMO Undo the previous partition operation. 

These communication methods are implemented using MPI collective communi- 
cation methods. All methods are supported by PEMS excluding partitionCGM and 
unPartitionCGM, which depend on the MPI_Comm_split call which is not currently 
implemented by PEMS. 

CGMLib also provides additional utilities, such as communication and compu- 
tation benchmarking, a system for routing data requests between processors, and 
commonly useful algorithms such as sorting, prefix sum, and list ranking. 
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8.4.1 Sort 

The sorting algorithm included in CGMLib is a simple deterministic parallel sample 
sort, based on PSRS [I?] and techniques described in [5]. The figures in this section 
show the performance of this sort under PEMS. 

Though CGMLib Sort is similar to PSRS, performance under PEMS differs dra- 



matically from the straightforward PSRS MPI implementation presented in §8.3 
Unfortunately, characteristics of CGMLib and PEMS interact in ways that limit the 
problem size achievable for a given system. In particular, the CGMLib sort allocates 
much more memory. In the context for which CGMLib was originally designed (di- 
rect execution on a cluster using MPI) this does not significantly impact performance. 
However, when explicit I/O is used in PEMS, the amount of memory allocated has 
a very large impact on performance since this dictates the amount of swapping I/O 
performed. The problem is amplified by the fact that the CGMLib communication 
primitives typically use several MPI communication functions, which results in a 
larger number of swaps each superstep. 



Though n is much smaller than the PSRS results from ^8.3, because of a larger 
constant factor of memory consumption, the runs in this section represent a large 
problem in terms of the amount of data handled by PEMS. The largest runs reach 
the limit of available disk space on our test configuration. Importantly, though n itself 
is not very large from an EM perspective, the actual amount of allocated memory 
used is well in excess of the total amount of available system RAM. 



8.4.2 Prefix Sum 

The CGMLib Prefix Sum application finds the inclusive prefix sum of an array dis- 
tributed across all processors. 

The inclusive prefix sum of an array [oq, ai, . . . , a„_i] is the array [oq, a^+ai, . . . , ao+ 
Oi + . . . + On-i], i.e. each element in the result is the sum of that element and all pre- 
vious elements. For example, the prefix sum of [1, 2, 3, 4] is [1, 3, 6, 10]. 

This application shows similar performance to CGMLib Sort. This is expected, 
since both algorithm perform a small constant amount of communication. 
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8.4.3 Euler Tour 



The CGMLib Euler Tour application [9] finds the Euler Tour of a forest. 

The Euler Tour of a graph is a path which traverses every edge of the graph exactly 
once and returns to the starting point. In order to apply this problem to a tree, or a 



forest (a collection of trees), each edge is doubled. Figures 8.21, 8.22, and 8.23 show 
example input, transormed input, and output for this problem, respectively. The 



labels on nodes in 8.23 represent the order each node is visited in the Euler Tour. 



Figure 8.21: Euler Tour Input 



Figure 8.22: Euler Tour Input (Doubled Edges) 




Figure 8.23: Euler Tour Solution 



This application is significantly more complex than CGMLib Sort and CGMLib 
Prefix Sum, and uses several other facilities of CGMLib (including sorting and list 



ranking). Fig. 8.24 shows the performance of the CGMLib Euler Tour application 



with memory mapped I/O. Here, n refers to the total number of trees in the forest, 
each of which contains nodes. 
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8.4.4 CGMLib + PEMS Conclusions 

Though the high constant factor of memory consumption prevents the CGMLib sort 
from being competitive with the simpler PSRS implementation, the results do show 
favourable scalability characteristics. It is likely that improvements to the PEMS 
memory allocator to reduce fragmentation, reductions in CGMLib's memory usage, 
and other improvements would result in a significant reduction in constant overhead 
and make PEMS+CGMLib a competitive solution. 

A positive outcome of these experiments can be seen in the results where mmap I/O 
is used. The problems described above are directly related to the use of explicit I/O 
- more allocated memory translates to more I/O. Memory mapped I/O, however. 



avoids this problem (as described in detail in [5.2). CGMLib, then, provides an 
ideal example of where the new memory mapped I/O capability of PEMS can be 
advantageous. As can be seen in the results, the CGMLib applications perform 
dramatically better with memory mapped I/O compared to Unix and STXXL I/O. 
This is because the large amount of allocated memory is not entirely used in each 
superstep, allowing the Operating System's cache to do a good job of keeping required 
data in memory across supersteps, avoiding disk I/O. This confirms that memory 
mapping is an effective strategy in PEMS for improving the performance of algorithms 
with certain characteristics. 



Chapter 9 



Conclusions 

This thesis presents PEMS2, an improved version of PEMS (Parallel External Memory 
System). PEMS can be used to execute BSP-like algorithms implemented as MPI 
programs with massive data sets larger than main memory by utilising disk. 

PEMS2 incorporates most of the future work mentioned in the literature associ- 
ated with PEMSl [T3]. In particular, PEMS2 adds multi-core support, asynchronous 
I/O, and reduces disk requirements. Beyond this, a new message delivery strategy 
has been introduced, and the implementation heavily reworked to a more flexible and 
easy to use form, with MPI compatibility and a run-time configuration system which 
allows simple experimentation with any algorithm. 

Experiments show that PEMS2 performs significantly better than PEMSl, par- 
ticularly when using the full resources of modern hardware. 

9.1 Future Work 

Since PEMS2 is simple to use with existing MPI code, much of the interesting work 
to be done based on this thesis is experimentation with various algorithms and con- 
figurations. It is hoped that PEMS2 will prove useful in practice to other researchers 
and practitioners interested in very large problems. 

However, there are several potential avenues of investigation related to PEMS 
itself: 

• Further MPI Compatibility: Unfortunately, many existing MPI programs are 
not actually BSP or BSP-like algorithms. It may be possible to implement non- 
collective communication functions (e.g. MPI_Send and MPI_Recv) in PEMS. 
However, because these functions do not adhere to the superstep model (i.e. 
they are not collective communication methods), implementation in PEMS may 
be difficult. If possible, though, this would increase the number of readily avail- 
able PEMS compatible programs dramatically. While there may be negative 
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performance implications with algorithms that use these functions, the fact that 
a great deal of MPI programs use them is undeniable. Such algorithms, if well 
designed, could be practical EM solutions if PEMS had efficient support for 
non-collective communication. 

Asynchronous and Memory-Mapped I/O: Further investigation is required to 
evaluate the effectiveness of these modifications. The experiments in this the- 
sis do not fully investigate the full potential of these new styles of I/O. While 
experiments with CGMLib have shown memory mapping to be a useful strat- 
egy, results for asynchronous I/O have (perhaps surprisingly) not shown much 
advantage. 

Dynamic a: The communication "chunk size" parameter used by Alltoallv, a, 
is currently specified as a user parameter. This could be made dynamic, so a is 
as large as possible for each communication (e.g. an Alltoallv with very small 
messages would only perform a single network communication). 

Fully Asynchronous Design: Lack of significant performance increases seen when 
using asynchronous STXXL I/O suggest that synchronisation in PEMS limits 
performance. A fully asynchronous design where both network communication 
and disk delivery are handled by a separate "controller" thread could alleviate 
this problem. In such a design, virtual processors would first record their mes- 



sage destinations much like the current Alltoallv design in PEMS2 (see ^6.2). 
Rather than delivering in synchronised rounds, however, messages would be 
sent over the network immediately. The controller thread on the receiving real 
processor would receive the message and immediately write it to the correct 
location on disk. This way, delivery of messages does not require synchronisa- 
tion between the sending and receiving virtual processors, thus communication, 
computation, and I/O overlap would be significantly increased. Such a design 
would also be more appropriate for implementing non-collective communication 
methods like MPI_Send and MPI_Recv. 



Multi-Core and Multi-Disk: Due to hardware limitations, the multiple-disk 
capabilities of PEMS2 have not been tested. It is likely that the use of several 
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cores and several disk will show significant performance improvements since 
several virtual processors running at once can make full use of up to k disks (or 
more, depending on the type of I/O used). 

Disk Scheduling: Recent versions of Linux include several disk scheduling al- 
gorithms. It would be interesting to investigate the impact these have on the 
performance of PEMS. 



CGMLib: The scalability shown by CGMLib+PEMS2 experiments in Q is 
promising, but absolute performance is hindered by excessive copying and mem- 
ory consumption. Improving these characteristics of CGMLib would make the 
combination of CGMLib and PEMS2 a more practical solution, and due to the 
impressive breadth of functionality available in CGMLib, simplify the develop- 
ment of many advanced EM algorithms with PEMS2. 

New Architectures: There are interesting similarities between disk-based mod- 
els such as those used in this thesis and increasingly popular special purpose 
multi-core (e.g. the Cell BE) and General Purpose Graphics Processing Unit 
(GPGPU) architectures. Both have a fast local memory store, and a slower 
external memory store. Transferring data between these two stores is a key 
factor in performance. PEMS, with some modifications, may allow suitable 
BSP algorithms to run on these new architectures with good performance. The 
implementation currently contains a "mem" I/O driver which simply uses al- 
located memory and does no I/O at all. This driver shows good performance 
and multi-core speedup (but of course can not scale beyond the limits of RAM). 
This illustrates that PEMS2 is not inherently tied to disk I/O, and adapting 
PEMS2's strategy to novel architectures is an interesting avenue for future re- 
search. 
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Appendix A 



Availability of PEMS2 



PEMS is freely available online at http://pems.sourceforge.net/. PEMS is Free 
/ Open Source software licensed under the GNU General Public License (GPL). Es- 
sentially this means PEMS is free to use, modify, and distribute, but all derivative 
works must also be released with source code under the GPL. The PEMS2 implemen- 
tation is designed to be simple to use and extend (e.g. use with applications is trivial 
due to MPI compatibility, adding new I/O drivers is straightforward). Use, experi- 
mentation, and modification is encouraged; if licensing is an issue please contact the 
authors. Contact information is available at the PEMS website. 

PEMS can be compiled and installed using the typical process for UNIX soft- 
ware: ./configure; make; sudo make install. Run ./configure — help for a 
summary of the available compile-time options which can be passed to . /configure. 

Using PEMS with MPI programs is as simple as using any other MPI implemen- 
tation. No source code modifications are necessary. There are two ways of doing 
so: 

1. PEMS uses the pkg-config system for compiler and linker flags. The com- 
mand pkg-conf ig — cflags pems2 will return the compiler flags required for 
building against PEMS2, and pkg-config — libs pems2 will return the linker 
flags required for linking against PEMS2. 

2. Like many MPI implementation, PEMS ships with compiler wrapper scripts 
to automatically add the necessary compiler and linker flags. Simply replacing 
uses of mpicc and mpic++ with pemscc and pemsc++, respectively, will build an 
MPI program against pems. Most MPI programs ship with a Makefile where 
this modification can be easily made. 



Appendix B 



Conventions 
B.l Terminology 

real processor A single pliysical computer which runs a single process (of possibly 
many threads) and may have several cores which share main memory. 

virtual processor A processor in the simulated bulk-synchronous algorithm. 

thread The implementation of a virtual processor. While there is a 1 : 1 corre- 
spondence between threads and virtual processors, the two are not identical - a 
thread performs work internal to the PEMS implementation in addition to the 
work of the simulated virtual processor. 

context The memory of a virtual processor, which may exist on disk or in main 
memory depending on whether or not the virtual processor is currently swapped 
in. 

memory peirtition A context-sized block of real main memory into which a context 
is swapped. Unlike contexts, all memory partitions fit into main memory at 
once. 

internal superstep A superstep performed internally by PEMS (e.g. as part of a 
multi-superstcp communication method). 

virtual superstep A superstep performed by the simulated algorithm (the simula- 
tion of a virtual superstep may require several internal supersteps). 

swap The process of writing/reading an entire context to/from disk. The more 
specific terms swap in and swap out are used where necessary. 
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Notation 


n 


— Size of the problem to be solved 


t 


= a thread ID which is a local identifier for a thread {0 < t < ^ 


P 


— the current virtual processor's global ID {0 < p < v). 




= a message sent from virtual processor i to virtual processor j, 




= X rounded down to the nearest disk block boundary. 


M 


= X rounded up to the nearest disk block boundary. 


M 


= the smallest block aligned region containing range r 


M 


= the largest aligned region within range r 



B.3 Simulation Pctrameters 

A PEMS simulation has the following run-time parameters: 

P — Number of real processors 

IX = Memory size of a single virtual processor 

D = Number of disks per real processor {D > 1) 

V — Total number of virtual processors (v > P) 

k — Number of concurrent threads per real processor {k < ^) 

a — Size of the "shared buffer" in main memory 

B.4 System Parameters 

For performance analysis we make use of the following variables, which are essentially 
those of the BSP* model (in lowercase) with analogous variables (in uppercase) to 
represent EM performance: 

b — Minimum size of a network message to achieve rated throughput (BSP*) 

g — Time to dehver a network packet of size 6, or if P = 1 (BSP*) 

I — Overhead of a single network superstep (BSP*) 

B = Size of a single disk block (EM) 

G — Time to write/read a single block of size B to/from disk (EM) 

L — Overhead of a single virtual superstep (EM) 

S — Time to swap a single block of size B to/from disk (EM) 
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L represents the constant overhead of a virtual superstep, including any synchro- 
nisation time and swapping I/O. L may vary depending on the I/O system in use. 
Note that, due to the introduction of memory-mapped I/O, this is in contrast to 
previous work related to PEMS [13] [7] [6] [15] [16] where L does not include any I/O. 

S is used to separate terms representing swap I/O from terms representing message 
delivery I/O. S is identical to G when using explicit I/O (i.e. UNIX or STXXL), and 
by definition when using memory mapped I/O. 

g represents parallel network performance given a fully connected network, i.e. 
each processor can send messages directly to each other processor. When two proces- 
sors communicate, network bandwidth between other processors is not affected. This 
is true for switched ethernet networks, but not true for ethernet hubs. Hubs are not 
suitable for high performance computing/net working, but this is not a concern with 
modern hardware since switches have reduced in price so much that even low-end 
consumer hardware is typically switched. 
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Methodology 



C.l Hardware/Software Configuration 

Experiments were run on tlie HPCVL Beowulf cluster at Carleton University. Each 
node has 2 dual-cor AMD Opteron 2214 processors at 2.2 GHz with 1 MiB cache 
per core, 8 GiB RAM, and a single 200 GiB disk. Nodes are interconnected with a 
high-end gigabit ethernet switch. 

Linux 2.6.28 was used, with the ext4 filesystem and standard I/O scheduling. All 
code was compiled with GCC 4.1.1. 



C.2 File Systems 

EM algorithms, including PEMS, attempt to optimise disk access for performance 
reasons. Locality of reference and favourable access patterns (e.g. linear sweeps) 
provide the best performance, since disk seeking is extremely expensive. However, in 
practice most applications running on a modern operating system are not actually 
accessing disk directly - disk access is provided via a file system. This has the 
implication that a linear sweep in code may not actually translate to a linear sweep 
on disk due to file system fragmentation (files are generally not guaranteed to be 
contiguous ranges of blocks). This can result in unpredictable performance. 

Thankfully, some file systems take this into consideration and provide facilities for 
allocating large areas of disk. The new default file system for Linux, ext4, includes this 
ability (support for "extents"). PEMS2 makes use of this functionality on systems 
modern enough to support this feature. All experiments in this thesis explicitly 
allocate disk via this mechanism. 



Fig. C.l shows the potential impact a fragmented filesystem can have. In this ex- 



periment, all parameters remain constant including n and only n is increased. That 



^i.e. 4 cores total 



100 



is, more disk space is used, but the actual problem size remains constant. Notice 
ext4 (with extents) has consistent performance regardless of the space used, but ext3 
(without extents) degrades in performance severely as more space is used. Further 
experiments have shown that, without extents, disk performance can be very unpre- 
dictable. 



Note that Fig. |C.1| illustrates a particular pathological case. In particular, it 
is possible (with luck) to achieve nearljj^ equivalent performance without explicitly 
allocating disk. However, unless the file system is entirely empty, this is extremely 
unlikely for large files. 

Those working with external memory algorithms should be careful to choose 
an appropriate filesystem, and make use of explicit disk space allocation routines 
(f allocate or posix_f allocate in Linux) to ensure good, predictable performance. 
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Figure C.l: ext3 vs ext4 



^Using extents actually reduces filesystem overhead in general, since blocks are much larger and 
less tree traversal is required. Thus, using extents can yield a performance improvement even on a 
completely unfragmented filesystem 
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MPI Compatibility 



Fig. D.l shows the subset of MPI implemented by PEMS2. Additionally, malloc, 
realloc, and free are wrapped by PEMS to allocate memory in the virtual processor 
context rather than system RAM. 
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Figure D.l: Supported MPI Functions 



